esmero / format_strawberryfield

Set of Display formatters, extractors and utils to make Strawberry field data glow
GNU Lesser General Public License v3.0
6 stars 10 forks source link

Bookreader Navigation of search results within Creative Work Series objects has incorrect sequence #470

Open patdunlavey opened 1 month ago

patdunlavey commented 1 month ago

We have a object type that we call "CWSBook", which is basically a Creative Work Series (digital object collection) object with a series of "Page" objects as children that have the individual page image files. We are able to display this in IA Bookreader using the v2.1 IIIF Manifest. OCR is indexed for these pages with both sequence_id and parent_sequence_id - where the latter records the order of the Pages in the CWSBook.

The problem we are having is that when a search is performed in IAB, the search results are returned sorted by relevance, and not by the page sequence, and this results in the previous/next links in the search results area to jump the user seemingly wildly back and forth.

This is because the sequence of the pages being displayed in IAB corresponds to (comes from?) the parent_sequence_id value indexed in solr, not the sequence_id, whereas \Drupal\strawberryfield\Controller\StrawberryfieldFlavorDatasourceSearchController::flavorfromSolrIndex sorts only by sequence_id, when present.

I'm able to force the correct sequence in the results and correct IAB behavior in my testing with this simple change

      if (isset($allfields_translated_to_solr['parent_sequence_id'])) {
        $fields_to_retrieve['parent_sequence_id'] = $allfields_translated_to_solr['parent_sequence_id'];
        // Creative Work Series need to first be sorted by parent_sequence_id.
        $query->sort('parent_sequence_id', QueryInterface::SORT_ASC);
      }

However, I'm not sure if there are cases where this flavorfromSolrIndex method may be called where this sort is not the correct behavior. I also wonder if sorting by sequence_id - the current behavior - is always correct? The flavorfromSolrIndex method seems to be intended to be used pretty broadly. Are there cases where only sort by relevance is correct?

patdunlavey commented 1 month ago

From @DiegoPino on slack: Diego Pino 4:39 PM So there is one issue (why i avoided it...)

if (isset($allfields_translated_to_solr['parent_sequence_id']) &&
  isset($extradata_from_item['search_api_solr_document'][$allfields_translated_to_solr['parent_sequence_id']]) && !$result_is_top_media) {
  $sequence_number = (array) $extradata_from_item['search_api_solr_document'][$allfields_translated_to_solr['parent_sequence_id']];
  if (isset($sequence_number[0]) && !empty($sequence_number[0]) && ($sequence_number[0] != 0)) {
    // We do all this checks to avoid adding a strange offset e.g a collection instead of a CWS
    $page_number_by_id[$extradata_from_item['search_api_solr_document']['id']] = $sequence_number[0];
  }
}
patdunlavey commented 1 month ago

@DiegoPino - in https://github.com/esmero/strawberryfield/blob/1.5.0/src/Controller/StrawberryfieldFlavorDatasourceSearchController.php#L112-L122 we have...

    //search for IAB highlight
    // if format=json, page=all and q not null
    if (($input = $request->query->get('q')) && ($format == 'json') && ($page == 'all')) {

      $snippets = $this->flavorfromSolrIndex(
        $input,
        $node->id(),
        $processor,
        $fileuuid,
        $indexes
      );

Do I understand the "search for IAB highlight" comment to mean that the conditions there (non-empty 'q' param, result format = 'json', page = 'all') indicate that we are searching for and intend to return ocr highlights? If so (and assuming the if condition is valid for this purpose), then can we use this as the test to determine if the search results should be sorted by sequence ids - both sequence_id and parent_sequence_id?

patdunlavey commented 1 month ago

Hi @DiegoPino, I've ascertained that, for my project, the solution I proposed here is safe/valid: I have no cases where adding sorting by parent_sequence_id would throw off the desired sorting.

For my purposes, for now, I think I'll just do a composer patch so that I'm not blocked on this.

But I know we need to think more expansively and identify cases where this assumption is not valid. What are the cases where parent_sequence_id would have a value in a strawberryflavor record where it would invalidate the sorting? How can we programmatically identify those cases, either directly inside ::flavorFromSolrIndex, or inside ::search (passing a flag into ::flavorFromSolrIndex), so that it knows not to sort on the parent sequence id)?

Closely related question: if sorting on parent_sequence_id is sometimes not correct, is always sorting on sequence_id correct? Are there times when ::flavorfromSolrIndex is called when the search result relevance actually is the correct sort?

Finally, do we handle missing values sort correctly for this case (missing values should sort last in an ASC sort, I think - we would not want to sort by parent_sequence_id and have all the records with no parent_sequence_id come out first)?