Check title participaton in all field search

ryan-jacobs commented 2 years ago

It's possible that the title string is not included correctly in all-field searches. We need to check this.

Here's an example that currently duplicates this:

A search for "Annual report of the Kahuku Plantation Company" against "All Fields" does not lead to a hit, but searching specifically for this string using "Title" does generate a hit.

ryan-jacobs commented 2 years ago

This may have something to do with stemming or how title variations are indexed differently for "title" vs "all field" searches.

In searchspecs.yaml I can see the following is used to process title searches:

Title:
  DismaxFields:
    - title_short^500
    - title_full_unstemmed^450
    - title_full^400
    - title^300
    - title_alt^200
    - title_new^100
    - title_old
    - series^100
    - series2
  DismaxHandler: edismax

But what's used to process all field searches is slightly different:

AllFields:
  DismaxFields:
    - title_short^750
    - title_full_unstemmed^600
    - title_full^400
    - title^500
    - title_alt^200
    - title_new^100
    - series^50
    - series2^30
   [other non-title values...]
  DismaxHandler: edismax

As far as I can see the differences only relate to boost values, but perhaps we need to consider if these rules need to be brought closer into alignment.

I could see discrepancies like this playing out in a number of other cases that are not specific to just title searches. We should also get some guidance from EBSCO on the right strategies for tuning keyword searches like this.

ryan-jacobs commented 2 years ago

Upon further analysis this seems to be a regression introduced by #16 as I can eliminate this problem by removing the oclc_num field from the AllFields group:

AllFields:
  DismaxFields:
    - title_short^750
    - title_full_unstemmed^600
    - title_full^400
    - title^500
    - title_alt^200
    - title_new^100
    - series^50
    - series2^30
    - author^300
    - contents^10
    - topic_unstemmed^550
    - topic^500
    - geographic^300
    - genre^300
    - allfields_unstemmed^10
    - fulltext_unstemmed^10
    - allfields
    - fulltext
    - description
    - isbn
    - issn
    - long_lat_display
#    - oclc_num
  DismaxHandler: edismax

It looks like all the other fields in this list are a "text" type or "isn" type which all perform tokenization. On the other hand, oclc_num, is type "string", which does not perform tokenization. My best guess is that adding a string type to the mix here disables tokenization for the whole query, which is of course very bad. The fact that the "Annual report of the Kahuku Plantation Company" string contains "of the" makes me think that some tokenization is going on when the query is being pre-processed, which is somehow incompatible with that string type field. This could explain why the problem was only popping up for a subset of queries.

We could:

Simply remove oclc_num from the "All fields" query
Create an oclc_num_text field and just copy oclc_num into it and then use oclc_num_text on the AllFields list.

Option 1 seems like bad UX as long as we are also exposing OCLC num as a targeted query option, so option 2 feels better. We just have to be careful about preventing partial hits against OCLC numbers in queries.

ryan-jacobs commented 2 years ago

I'll also mark this high priority as this really feels like something that could get in the way of other search behavior evaluations that are happening in other active issues right now.

ryan-jacobs commented 2 years ago

It seems that just ensureing the oclc number is copied (solr copyField) into the allfields property should be enough to make things work without the need to create a new oclc_num_text field.

Center-for-Research-Libraries / vufind

Check title participaton in all field search #40