blakearchive / erdman

0 stars 0 forks source link

exact string searching issues #56

Open ba001 opened 7 years ago

ba001 commented 7 years ago

@queryluke if you have time, could you have a look at these holdover issues?

if you search on erdman "execution is the chariot of genius", with double quotes, you get no results. if you search execution is the chariot of genius, you get a correct result, but then a bunch of other results from the same page that contain some of the words

i think we'd like to be able to search exact strings using double quotes, and--@joeafletch correct me if i'm wrong--a hit should only get returned if all the words (unquoted) from a query match

ba001 commented 7 years ago

@queryluke ah, wait, i think the second of those issues is not actually an issue. i think what's happening is that if a poem or work contains all the words (execution, is, the, chariot, of, genius), then all the lines with all or some of those words will get returned, which is how it should work. so it's only the first issue concerning the exact string in quotes that's a problem

queryluke commented 7 years ago

I think the problem is spaces in the query not being url encoded. When I use the solr admin console to run the query, this is what the query string looks like: q=text_contents:"execution%20is%20the%20chariot%20of%20genius"

But I don't see the conversion anywhere in the scripts. There are lots of places your can make the encoding, using the default javascript encodeURIComponent

The form is found: https://github.com/blakearchive/erdman/blob/master/client/src/components/search-form.js You could urlencode the query before it's sent to the main controller.

When you submit, the contents of the form get sent to the main controller: https://github.com/blakearchive/erdman/blob/master/client/src/erdman.controller.js#L47 You could urlencode the query here.

Line 52, runs the search, which happens in this file: https://github.com/blakearchive/erdman/blob/master/client/src/services.js#L19 Another option here

This file passes the query to the python files: https://github.com/blakearchive/erdman/blob/master/server/erdman/service.py#L25 Finally, here are the python scripts. I'd say you could do it here, but it looks like python requires an additional package, urllib, to do any url encoding. I'm pretty sure we have that package installed, so you could try it.

ba001 commented 7 years ago

@queryluke hmm, when i tried using encodeURIComponent() and searching "keep it" (with quotes) i got this error on the site:

No results found for %22keep%20it%22

ba001 commented 7 years ago

tried putting it in search-form.js:

onSubmit(){
        console.log('submitting');
        this.query = encodeURIComponent(this.query);
        this.onSearch({query: this.query});
    }
queryluke commented 7 years ago

hmm, that probably means you'll have to tinker with the python scripts.

I know a lot about how php and ruby query solr, but I'm not familiar with python. For example, in php, if you curl something like http://localhost:8983/solr/core/select?q=title:%22aquatic%20plant%22&wt=json it works just fine.

But it looks like this python library (pysolr) sends a literal string? I don't know, seems odd. Maybe ask nathan about it?

ba001 commented 7 years ago

Thanks, Luke. @nathan-rice do you have any insight into this one?

ba001 commented 7 years ago

@nathan-rice just wanted to ask you about this one again. at the moment, we can't do exact string searches on erdman

nathan-rice commented 7 years ago

I pushed a fix for this a while ago.

ba001 commented 7 years ago

@nathan-rice ok, just deployed. looks good except for one thing. search "well well". notice the results for page 15. one line there is broken into two. the resulting line in the document begins "Well well..." so the result should just show that line for page 15.

nathan-rice commented 7 years ago

The problem here is that solr is returning multiple "highlight" snippets for that search (each only containing a single "well") - not sure why, seems like a bug to me. I could fix this case by just merging all highlight snippets for any given poem, but that would almost certainly break a ton of other cases.