Spike: Explore the possibility of boosting the precise title field to get better results with exact title searching

tclayton33 commented 1 year ago

I have several examples from the Library Search Committee of title searches (mostly of short, exact titles) that are producing unsatisfactory results. I'd like to have a discussion with the developers to explore what could happen if the precise title fields is boosted.

Here is the example file

rotated8 commented 1 year ago

Here's how I reproduced one of the examples- the "Central Station" one.

This is a title search, so lets look at the definition of a title search.

First, look at the fields title search queries.
Next, look at the definition of the title search itself.

From this we know how to replicate the search in Solr.

Go to the Solr interface for the environment we want to try this query in, and select the collection with the appropriate data.
On the query page, set q to be our query, "Central Station", select the checkboxes for debugQuery and edismax, and finally, concatenate the list of title fields from above (do this programmatically unless you like removing all the quotes) and set that as qf.
Click the 'Execute Query' button.

The results should mostly match what was reported by the committee, as long as the data in the selected collection is the same as prod.

At the bottom of the results, Solr will provide a 'debug' section, which includes an 'explain' subsection. This shows the math that creates the scores that provide our relevancy. The two most important explanations are for the result the committee thinks should be higher (id '990000746450302486', score 264.82336), and for the result above it (id '990022621180302486', score 270.9209). Full and truncated explanations are in Sharepoint.

The main takeaway is that the lower ranking result has more text in its description than the higher ranking one, and so "Central Station" is less of that item's description than it is for the higher ranking result.

rotated8 commented 1 year ago

@abelemlih I am surprised to see all_text_timv affecting the score of a title search. Can you look into that? I should note that I cannot get similar results from Solr if pf='' is set, as in the search definition.

rotated8 commented 1 year ago

@tclayton33 I must warn you and the committee: changes in relevance will have knock-on effects. We cannot boost one term without effectively de-boosting all the rest. If the committee is happy with results generally, there is no way to change boosts without changing those other results.

tclayton33 commented 1 year ago

@rotated8 @abelemlih The committee did discuss that any changes we make in this area could have undesirable consequences, and we do want to prevent that. But, a considerable number of members are also dissatisfied with some of the results for short, exact titles. I just added a new example that came in from a faculty member last week (no.7). I've also run some comparable searches in Stanford's catalog and added those links to the example document. I'm not sure what Stanford is doing (it may be a lot more complicated than boosting the one field the committee was proposing), but I think their title search results for al-Khaṣāʼiṣ , JAMA, Radiographics, and Traditio are more in line with the behavior our users are expecting.

It's hard to talk about possible consequences in the abstract. We were hoping to ultimately alleviate the concern of creating unfavorable consequences by conducting thorough testing in the blackcat-test environment. Because production and blackcat-test are using the same indexes, the committee members would be able to run side by side comparisons to make sure the results are acceptable...or not.

tclayton33 commented 1 year ago

@abelemlih and @rotated8 Pardon my newbie question, but that score that Ayoub assigned is just to complete the research for the spike, correct?

emory-libraries / blacklight-catalog

Spike: Explore the possibility of boosting the precise title field to get better results with exact title searching #1346