USDA / USDA-APIs

Do you have feedback, ideas, or questions for USDA APIs? Use this repository's Issue Tracker to join the discussion.
www.usda.gov/developer
107 stars 16 forks source link

Search algorithm is not inclusive enough #82

Open seanharr11 opened 4 years ago

seanharr11 commented 4 years ago

This may be a bit of an opinion - but if I send generalSearchInput of mushroo to the Food Search API, I get 0 results back. However, sending mushroom to the API yields many, many results.

Is there a similarity algorithm being used here that yields 0 results? In my experience, a case insensitive substring match typically does the trick for most searches...especially when it is supplemented by something more clever (like trigram, levenshtein, etc...).

littlebunch commented 4 years ago

@seanharr11 My Lucene is a little rusty but I believe Levenshtein is baked in by default. In Elasticsearch, I think you do need to explicitly perform a fuzzy query. I suspect, this is not happening here. In the case of 'mushroo' my guess is no results are returned because the stemmer doesn't resolve a root. You could use a wildcard, e.g 'mushroo*' which returns the same result count as 'mushroom'.

Kind of interesting if I send a fuzzy query for 'mushroo' against the FDC index I get 1,537 hits as opposed to 1,534 hits with the wildcard query versus 0 hits without either. I assume there's 3 documents that contain some misspelling of mushroom which would account for the difference between fuzzy and wildcard.

My personal opinion is that "fuzziness" should be used for default queries since users do make typos. For my demo site, which uses a search engine similar to Lucene with "fuzziness" distance of 2 by default, this request:

curl -XPOST -H "Content-type:application/json" https://go.littlebunch.com/v1/foods/search -d '{ "q":"mushroo","searchfield":"foodDescription","page":0,"max":10}'

returns the same result set as a searches on "mushroom" or "mushroum" or "mishroom" etc.

seanharr11 commented 4 years ago

Ok great - wasn't sure what the search service was under-the-hood. That said, it appears that once we throw a wildcard '' in there (e.g. `mushroo`), we get the results back in alpha order, as oppose to being sorted by relevance. Any insight there?

And I agree on your "fuzziness" default, I've tried combinations of "Some fuzzy results" and "same exact results" and had decent luck as well!

littlebunch commented 4 years ago

My guess is that wildcard queries are functionally equivalent to a "match-all" query. Just guessing again but to get scoring you'd have to design some sort of boolean query to add a boost for the term for which you're really looking, e.g. 'mushroo* and boost documents with mushroom'. These kinds of machinations are hard to do for a public API. I believe similarity queries are scored. Interesting ....

SteveCEms commented 4 years ago

I would like to put in a vote for at least a substring match and better yet a fuzzy search. Most of my users use a phone as the interface to the FDC database, so errors in typing are frequent. A fuzzy search would be ideal in this case.

SlySy1 commented 4 years ago

Another thing about the search functionality that puzzles me is that even when searching in "match any word" mode the order of the words you give matters. How can "ground chicken" yield a lot more matches than "chicken ground"? Maybe I'm missing something here?

littlebunch commented 4 years ago

For me {"query":"ground chicken","pageNumber":1,"pageSize":1} returns the same number of results as {"query":"chicken ground","pageNumber":1,"pageSize":1} -- 19,807. However, {"query":"\"chicken ground\"","pageNumber":1,"pageSize":1} returns a different result set than {"query":"\"ground chicken\"","pageNumber":1,"pageSize":1} because the additional quotes indicate the terms should be searched as a phrase and so word order matters.

SteveCEms commented 2 years ago

Just decided to add a wildcard in my searches in the database. When I enter apple butter in my app I send apple butter to the api. This does not return any hits with the words "apple butter". It does return hits with applebutter, applewood, applesauce, buttery, etc. If I enter appl butte the search returns results with apple butter. Of course this is not what I expected, it seems the wildcard does not include a space as a match. I can't use wildcards in my search unless I remove the last character of each word and replace it with an like this appl butte.

SteveCEms commented 2 years ago

I found another issue, it seems the FDC search does not find foods with "chocolate," in the name. For example, when you just search for "chocolate" you only get foods with "chocolate ". If you don't know this while searching, you may just ignore commas in food names. I don't have a good solution to this for my users.