CottageLabs / idfind

An identifier identifier
1 stars 0 forks source link

Certain identifiers cause an ES exception #4

Open emanuil-tolev opened 12 years ago

emanuil-tolev commented 12 years ago

Try identifying "-" (dash) without the quotes on the web front-end. ElasticSearch tries to parse the dash and throws an exception. The query needs to be escaped, in some way.

Only relevant thing I found was this ElasticSearch issue: https://github.com/elasticsearch/elasticsearch/issues/41

Which means that ES should allow escaping. As far as I understand, that would work by adding a field to the query_string JSON object, escape: true or escape: 1. Trouble is, I'm not sure how to modify the query_string JSON object - it seems to me that dao.DomainObject.query() just takes q="string", gives it to pyes, pyes turns this into a query_string object with the "query" field set to "string". Just can't quite grasp how to add "escape": "true" in this flow.

Any help? Looking at the pyes pydoc didn't yield an unexpected revelation...

emanuil-tolev commented 12 years ago

We might look at that for dev8d, but decided that we don't necessarily care enough about this edge case to fix it right now. It's not critical.

emanuil-tolev commented 12 years ago

UPDATE: the problem described in this comment turned out to be a separate issue and was resolved accordingly. The last comment below should give you the current status of the original issue.

Interesting, another problem: if you try to identify the string "car insurance systems" via both GET and POST with

pyes.urllib3.connectionpool.MaxRetryError MaxRetryError: Max retries exceeded for url: /idfind/uidentifier/car insurance system

I'm thinking the spaces have something to do with it.

emanuil-tolev commented 12 years ago

Another one, trying to identify "(Forenames Surname|forenames.surname@gmail.com|xxx@somewhere.ac.uk)" via both GET and POST.

pyes.urllib3.connectionpool.MaxRetryError MaxRetryError: Max retries exceeded for url: /idfind/uidentifier/(Forenames Surname|forenames.surname@gmail.com|xxx@somewhere.ac.uk)

emanuil-tolev commented 12 years ago

Okay, so all those errors were caused by commit 8f9746c which tried to prevent duplicates in the unknown identifiers (document type uidentifier) by assigning the identifier string as the document id. That doesn't sit very well with spaces and other "special" characters.

All fixed now, but now we've got a record of every unsuccessful attempt to identify something in the index. Note: every attempt (so if I try 'lalala' 5 times, we get 5 docs in the index). We don't really want duplicates as we want to run some sort of background processing to identify those unknowns using newly submitted tests in the future, but we'll have to solve this some other way.

emanuil-tolev commented 12 years ago

This issue stays open as original ES exception still not fixed AFAIK.

emanuil-tolev commented 12 years ago

MaxRetryErrors above have been fixed (by ef930d2be36e8303ee732bebf09fd38e798bc576 I suspect).