Open seanharr11 opened 4 years ago
@seanharr11 My Lucene is a little rusty but I believe Levenshtein is baked in by default. In Elasticsearch, I think you do need to explicitly perform a fuzzy query. I suspect, this is not happening here. In the case of 'mushroo' my guess is no results are returned because the stemmer doesn't resolve a root. You could use a wildcard, e.g 'mushroo*' which returns the same result count as 'mushroom'.
Kind of interesting if I send a fuzzy query for 'mushroo' against the FDC index I get 1,537 hits as opposed to 1,534 hits with the wildcard query versus 0 hits without either. I assume there's 3 documents that contain some misspelling of mushroom which would account for the difference between fuzzy and wildcard.
My personal opinion is that "fuzziness" should be used for default queries since users do make typos. For my demo site, which uses a search engine similar to Lucene with "fuzziness" distance of 2 by default, this request:
curl -XPOST -H "Content-type:application/json" https://go.littlebunch.com/v1/foods/search -d '{ "q":"mushroo","searchfield":"foodDescription","page":0,"max":10}'
returns the same result set as a searches on "mushroom" or "mushroum" or "mishroom" etc.
Ok great - wasn't sure what the search service was under-the-hood. That said, it appears that once we throw a wildcard '' in there (e.g. `mushroo`), we get the results back in alpha order, as oppose to being sorted by relevance. Any insight there?
And I agree on your "fuzziness" default, I've tried combinations of "Some fuzzy results" and "same exact results" and had decent luck as well!
My guess is that wildcard queries are functionally equivalent to a "match-all" query. Just guessing again but to get scoring you'd have to design some sort of boolean query to add a boost for the term for which you're really looking, e.g. 'mushroo* and boost documents with mushroom'. These kinds of machinations are hard to do for a public API. I believe similarity queries are scored. Interesting ....
I would like to put in a vote for at least a substring match and better yet a fuzzy search. Most of my users use a phone as the interface to the FDC database, so errors in typing are frequent. A fuzzy search would be ideal in this case.
Another thing about the search functionality that puzzles me is that even when searching in "match any word" mode the order of the words you give matters. How can "ground chicken" yield a lot more matches than "chicken ground"? Maybe I'm missing something here?
For me {"query":"ground chicken","pageNumber":1,"pageSize":1}
returns the same number of results as {"query":"chicken ground","pageNumber":1,"pageSize":1}
-- 19,807. However, {"query":"\"chicken ground\"","pageNumber":1,"pageSize":1}
returns a different result set than {"query":"\"ground chicken\"","pageNumber":1,"pageSize":1}
because the additional quotes indicate the terms should be searched as a phrase and so word order matters.
Just decided to add a wildcard in my searches in the database. When I enter apple butter in my app I send apple butter to the api. This does not return any hits with the words "apple butter". It does return hits with applebutter, applewood, applesauce, buttery, etc. If I enter appl butte the search returns results with apple butter. Of course this is not what I expected, it seems the wildcard does not include a space as a match. I can't use wildcards in my search unless I remove the last character of each word and replace it with an like this appl butte.
I found another issue, it seems the FDC search does not find foods with "chocolate," in the name. For example, when you just search for "chocolate" you only get foods with "chocolate ". If you don't know this while searching, you may just ignore commas in food names. I don't have a good solution to this for my users.
This may be a bit of an opinion - but if I send
generalSearchInput
ofmushroo
to the Food Search API, I get 0 results back. However, sendingmushroom
to the API yields many, many results.Is there a similarity algorithm being used here that yields 0 results? In my experience, a case insensitive substring match typically does the trick for most searches...especially when it is supplemented by something more clever (like trigram, levenshtein, etc...).