MrDys / blacklight

Blacklight Plugin
http://projectblacklight.org/
Other
1 stars 1 forks source link

Remove leading space and + in LocalParams query string in BL plugin and advanced search plugin #439

Closed MrDys closed 12 years ago

MrDys commented 12 years ago

CODEBASE-281: Original report is as follows (from Molly):

After switching to using solr LocalParams for my development version of VIRGO, I noticed that relevancy for fielded searching was not working as well as it had in my older version. For example, in my old version, a title search for "Time" would produce results where items with the title "Time" would float to the top, over, say items like "The Pennsylvania almanack [electronic resource], for the year ... 1790, ... By Mark Time." That's what I would expect, but using version of Blacklight that relies on LocalParams, "The Pennsylvania almanack [electronic resource], for the year ... 1790, ... By Mark Time" was my top hit for a title search on "time."

By examining my solr logs, I could see that the query that was generated looked like this:

sort=score+desc,+year_multisort_i+desc&q={!qf=$qf_title+pf=$pf_title}+time&qt=search

If I removed the plus before "time", everything worked as I would expect, relevancy-wise.

I noticed a similar thing with Advanced Search, where I would see the following from doing a title search:

sort=score+desc,+year_multisort_i+desc&q=query:"{!dismax+mm=1+qf=$qf_title+pf=$pf_title}%2Btime"&qt=search&defType=lucene

If I removed the %2B before "time", everything worked as I would expect, relevancy-wise.

So, I think that those spaces are causing problems, at least for the way my requestHandlers are configured. I located where those spaces are added in blacklight/lib/blacklight/solr_helper.rb and blacklight_advanced_search/lib/blacklight_advanced_search/dismax_query_parser.rb, so I'd be happy to change them and commit them to core, if this seems like the right thing to do to you all, but I figured I'd check in first. Anyone have any thoughts?


Follow-up from Erik Hatcher:

In IRC Molly and I chatted about this a bit further the other day. Turns out the space becomes quite significant when you've got a "string" field in your qf. The space is taken literally (since the analysis, or not in this case, is taken into account). Molly provided an example that parsed like this:

+DisjunctionMaxQuery((uniform_title_text:time | title_added_entry_text:time | title_text:time^20.0 | notes_text:time | subtitle_text:time^15.0 | series_title_text:time | alternate_form_title_text:time)~0.01) DisjunctionMaxQuery((author_unstem_text:time^20.0 | text:time^15.0 | title_facet: time^100.0 | title_unstem_text:time^20.0 | subject_unstem_text:time^20.0)~0.01)

Note the "title_facet: time^100.0", which likely doesn't match anything. Maybe it does match without the space though? (is title_facet lowercased such that Time magazine matches, maybe?)

MrDys commented 12 years ago

Original reporter: darthmolly

MrDys commented 12 years ago

jrochkind: I understand and can fix the issue in core Blacklight, which is just a space (that gets URI-encoded to "+").

Once I've done that, fi the problem still remains in advanced search plugin, please file another ticket, that's something else, I don't entirely understand.

MrDys commented 12 years ago

jrochkind: Oh, and if you want to change it yourslef, I say go ahead with the change to remove the space in blacklight/lib/blacklight/solr_helper.rb .

And add test to confirm that no space is added.

I don't think you can do the same in advanced search plugin though. I think the advanced search plugin may require the "+" for things to work properly -- it's possible the design of the advanced search plugin is fundamentally incompatible with a non-dismax field and/or a non-tokenized field. But I still don't entirely understand why the advanced search example query causes problems, I don't think the "+" should cause any problems, maybe we need Erik to help us figure it out again.

MrDys commented 12 years ago

darthmolly: On my local copy of advanced search, I modified this line:

text << value.gsub(/(|)/,"").gsub("AND","").split.collect{|w| (w.strip[0,1] == "+" or w.strip[0,1] == "-") ? w : "+#{w}"}.join("

to remove the last plus, as follows:

text << value.gsub(/(|)/,"").gsub("AND","").split.collect{|w| (w.strip[0,1] == "+" or w.strip[0,1] == "-") ? w : "#{w}"}.join("

and then it works as expected.

MrDys commented 12 years ago

jrochkind: Does it still pass all the tests? It makes it work as expected in your case, but I think it will break other cases. I think the "+" for "mandatory" is actually required in order to get the proper semantics from the advanced search plugin. But Jessie/Naomi would have to be talked to, they're the ones that understand what's supposed to be going on there -- or if your change makes it fails existing tests, that's a clue.

I believe that if you remove that "+" from the advanced search plugin, you will break it for other cases, while fixing it (for reasons we don't entirely understand) for yours.

On 1/25/2011 5:24 PM, Molly Pickral (JIRA) wrote:

 [ http://jira.projectblacklight.org/jira/browse/CODEBASE-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11571#action_11571 ]

Molly Pickral commented on CODEBASE-281:

On my local copy of advanced search, I modified this line:

text<< value.gsub(/(|)/,"").gsub("AND","").split.collect{|w| (w.strip[0,1] == "+" or w.strip[0,1] == "-") ? w : "+#{w}"}.join("

to remove the last plus, as follows:

text<< value.gsub(/(|)/,"").gsub("AND","").split.collect{|w| (w.strip[0,1] == "+" or w.strip[0,1] == "-") ? w : "#{w}"}.join("

and then it works as expected.

Remove leading space and + in LocalParams query string in BL plugin and advanced search plugin

             Key: CODEBASE-281
             URL: http://jira.projectblacklight.org/jira/browse/CODEBASE-281
         Project: Blacklight Plugin
      Issue Type: Bug
        Reporter: Molly Pickral
        Assignee: Jonathan Rochkind
         Fix For: 2.9

Original report is as follows (from Molly): After switching to using solr LocalParams for my development version of VIRGO, I noticed that relevancy for fielded searching was not working as well as it had in my older version. For example, in my old version, a title search for "Time" would produce results where items with the title "Time" would float to the top, over, say items like "The Pennsylvania almanack [electronic resource], for the year ... 1790, ... By Mark Time." That's what I would expect, but using version of Blacklight that relies on LocalParams, "The Pennsylvania almanack [electronic resource], for the year ... 1790, ... By Mark Time" was my top hit for a title search on "time." By examining my solr logs, I could see that the query that was generated looked like this: sort=score+desc,+year_multisort_i+desc&q={!qf=$qf_title+pf=$pf_title}+time&qt=search If I removed the plus before "time", everything worked as I would expect, relevancy-wise. I noticed a similar thing with Advanced Search, where I would see the following from doing a title search: sort=score+desc,+year_multisort_i+desc&q=query:"{!dismax+mm=1+qf=$qf_title+pf=$pf_title}%2Btime"&qt=search&defType=lucene If I removed the %2B before "time", everything worked as I would expect, relevancy-wise. So, I think that those spaces are causing problems, at least for the way my requestHandlers are configured. I located where those spaces are added in blacklight/lib/blacklight/solr_helper.rb and blacklight_advanced_search/lib/blacklight_advanced_search/dismax_query_parser.rb, so I'd be happy to change them and commit them to core, if this seems like the right thing to do to you all, but I figured I'd check in

first. Anyone have any thoughts?

Follow-up from Erik Hatcher: In IRC Molly and I chatted about this a bit further the other day. Turns out the space becomes quite significant when you've got a "string" field in your qf. The space is taken literally (since the analysis, or not in this case, is taken into account). Molly provided an example that parsed like this: +DisjunctionMaxQuery((uniform_title_text:time | title_added_entry_text:time | title_text:time^20.0 | notes_text:time | subtitle_text:time^15.0 | series_title_text:time | alternate_form_title_text:time)~0.01) DisjunctionMaxQuery((author_unstem_text:time^20.0 | text:time^15.0 | title_facet: time^100.0 | title_unstem_text:time^20.0 | subject_unstem_text:time^20.0)~0.01) Note the "title_facet: time^100.0", which likely doesn't match anything. Maybe it does match without the space though? (is title_facet lowercased such that Time magazine matches, maybe?)

MrDys commented 12 years ago

darthmolly: Yeah, I see where the pluses are relevant for certain kinds of searching. I'm also seeing some strange stuff with the query that gets generated. Here's the query string from my solr logs:

sort=score+desc,+year_multisort_i+desc&q=query:%22{!dismax+mm%3D1+qf%3D$qf_journal_title+pf%3D$pf_journal_title}%2Btime%22&qt=search&defType=lucene

If I use that directly in solr with debugQuery=true, I see this as the generated query:

+(+DisjunctionMaxQuery((journal_title_text:time)~0.01)) DisjunctionMaxQuery((author_unstem_text:time^20.0 | journal_title_facet:+time^100.0 | text:time^15.0 | title_unstem_text:time^20.0 | subject_unstem_text:time^20.0)~0.01)

Notice that it's "time" for most of the fields, but "+time" for journal_title_facet

MrDys commented 12 years ago

jrochkind: I think most of that code, which is kind of crazy and hard to debug/modify, is about transforming user-entered "AND" and "OR" into something dismax likes. I wonder if we (or all users) really need to support "AND" and "OR" and "NOT" anyway -- instead we could just tell users to use "+" and "-". That would allow for much simpler code, and eliminate the code that's putting the "+" in, perhaps erroneously.

The advanced search plugin already has a great architecture for plug-in-able parsers, we could write one that didn't do anything with "AND"/"OR", but just took the individual dismax parts direct and nests them in a lucene query.

I'm still a bit nervous about the fact that we don't really understand what's going on -- I don't entirely understand why the "+" is messing up the query, it doens't seem like it should hurt the query, I'd feel better about implementing a solution if I understood the problem, which I don't really. But making a simpler parser that doens't try to translate AND/OR/NOT probably can't hurt.

But as far as the FIRST problem with core (I think it's actually two unrelated problems), not advanced, we should make sure to commit a test with a fix. You want to do that part, or should I try to find time?

MrDys commented 12 years ago

darthmolly: I agree with Jonathan that we should treat BL plugin issue and BL advanced search plugin issue as separate.

I've committed code and tests to resolve the BL plugin issue.