Closed ryan-jacobs closed 2 years ago
This may have something to do with stemming or how title variations are indexed differently for "title" vs "all field" searches.
In searchspecs.yaml I can see the following is used to process title searches:
Title:
DismaxFields:
- title_short^500
- title_full_unstemmed^450
- title_full^400
- title^300
- title_alt^200
- title_new^100
- title_old
- series^100
- series2
DismaxHandler: edismax
But what's used to process all field searches is slightly different:
AllFields:
DismaxFields:
- title_short^750
- title_full_unstemmed^600
- title_full^400
- title^500
- title_alt^200
- title_new^100
- series^50
- series2^30
[other non-title values...]
DismaxHandler: edismax
As far as I can see the differences only relate to boost values, but perhaps we need to consider if these rules need to be brought closer into alignment.
I could see discrepancies like this playing out in a number of other cases that are not specific to just title searches. We should also get some guidance from EBSCO on the right strategies for tuning keyword searches like this.
Upon further analysis this seems to be a regression introduced by #16 as I can eliminate this problem by removing the oclc_num field from the AllFields group:
AllFields:
DismaxFields:
- title_short^750
- title_full_unstemmed^600
- title_full^400
- title^500
- title_alt^200
- title_new^100
- series^50
- series2^30
- author^300
- contents^10
- topic_unstemmed^550
- topic^500
- geographic^300
- genre^300
- allfields_unstemmed^10
- fulltext_unstemmed^10
- allfields
- fulltext
- description
- isbn
- issn
- long_lat_display
# - oclc_num
DismaxHandler: edismax
It looks like all the other fields in this list are a "text" type or "isn" type which all perform tokenization. On the other hand, oclc_num, is type "string", which does not perform tokenization. My best guess is that adding a string type to the mix here disables tokenization for the whole query, which is of course very bad. The fact that the "Annual report of the Kahuku Plantation Company" string contains "of the" makes me think that some tokenization is going on when the query is being pre-processed, which is somehow incompatible with that string type field. This could explain why the problem was only popping up for a subset of queries.
We could:
Option 1 seems like bad UX as long as we are also exposing OCLC num as a targeted query option, so option 2 feels better. We just have to be careful about preventing partial hits against OCLC numbers in queries.
I'll also mark this high priority as this really feels like something that could get in the way of other search behavior evaluations that are happening in other active issues right now.
It seems that just ensureing the oclc number is copied (solr copyField) into the allfields property should be enough to make things work without the need to create a new oclc_num_text field.
It's possible that the title string is not included correctly in all-field searches. We need to check this.
Here's an example that currently duplicates this:
A search for "Annual report of the Kahuku Plantation Company" against "All Fields" does not lead to a hit, but searching specifically for this string using "Title" does generate a hit.