Princeton-CDH / ppa-django

Princeton Prosody Archive v3.x - Python/Django web application
http://prosody.princeton.edu
Apache License 2.0
4 stars 2 forks source link

As a user, I want my search results from the Title field to prioritize unstemmed matches and boost title over subtitle. #221

Closed mnaydan closed 5 years ago

mnaydan commented 5 years ago

Notes for testing

mnaydan commented 5 years ago
  1. Title field searches on title and subtitle now.
  2. Title matches are generally boosted over subtitle, but not consistently. Is this just the way Solr works, or does this need fixing? For instance, searching "elements" yields a subtitle result as number 1, despite many main title matches. (Note: The subtitle result does have 82 internal full text keyword matches, as opposed to number 2's 5 full text matches, so maybe that's why the relevance rating is higher? But searching "lays of ancient" also yields a subtitle result first over a title match--and its relevance ranking doesn't correspond to higher full-text matches, so there goes that theory.) Searching "prosody" and "treatise" generally yields main title results first, but again, some subtitle matches are appearing among the list of title matches.
  3. Unstemmed matches are boosted over stemmed ones (tested with "element" )
  4. Multiple terms and quotes and Boolean are working (tested with "art of English" with and without quotes and "lays of ancient")

I'll let you tell me whether what I've found for number 2 sounds about right or is a bug.

rlskoeser commented 5 years ago

@mnaydan I saw some similar behavior with subtitle matches sometimes showing before title matches but I wasn't sure how much it was happening and wanted to get your eyes on it. Title should be boosted higher than subtitle based on the config I've got, but I looked again and the numbers aren't all that different - I will try changing the boost numbers to really emphasize title over subtitle to see if that makes a difference in this behavior.

rlskoeser commented 5 years ago

@mnaydan updated the test site with different boosting, see what you think. One of my test search terms was lives and it does seem to be working better, although now a stemmed title match is beating out an unstemmed subtitle match. I thought maybe the unstemmed matches were causing the behavior we saw before with subtitles showing first, and adjusted the boost levels accordingly. Maybe that's ok?

mnaydan commented 5 years ago

@rlskoeser this does seem way better--I redid the searches I used before, and the "lives" one, and it's definitely prioritizing title over subtitle now. I think I'm fine with stemmed title matches appearing before unstemmed subtitle matches. There's the question of whether quoted single terms (like "lives") should even be yielding stemmed matches at all, but it is generally prioritizing unstemmed matches so I think it's ok if you do.

rlskoeser commented 5 years ago

@mnaydan great, thanks for the careful testing. I think what we have is working pretty well (certainly better than the previous functionality), and we can always revisit later if we want to refine.