Closed johartl closed 8 years ago
@johartl - The first requirement is quite straight-forward, and can be done using regular expressions in MongoDB queries. I tried a sample query and it works.
The second one, regarding partial/fuzzy string match, however seems to need something like ElasticSearch. I am looking into other alternatives of achieving this, but just in case, would like to know if it is fine to use this. @sacdallago ??
If you're only suggesting for like ~150 pokemon, then I suggest you forget about using elastic search - just compute the hamming distances in memory, or think of a simple matching algo. ES is a pretty big dependency!
On Sep 10, 2016 10:38 PM, "Swathi S Sunder" notifications@github.com wrote:
@johartl https://github.com/johartl - The first requirement is quite straight-forward, and can be done using regular expressions in MongoDB queries. I tried a sample query and it works.
The second one, regarding partial/fuzzy string match, however seems to need something like ElasticSearch https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html . I am looking into other alternatives of achieving this, but just in case, would like to know if it is fine to use this. @sacdallago https://github.com/sacdallago ??
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PokemonGoers/PokeData/issues/136#issuecomment-246137967, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1hd9Ih7871clExRrCTMUTjt4HkQZ1uks5qoxU0gaJpZM4J5xgI .
I guess we could as well just retrieve all Pokemon once and do the search on the mobile device. No need to call the API at all?
Agree with @phdowling , ES is a bit too much. I know that mongo doesn't have an out of the box solution for it, and running a brute force alg with substitutions for each character is pretty demanding.
But the hamming distance idea sounds like a good one, especially because we only have 151 names and computing it against this list should be quick.
An additional thing to keep in mind for the simple distance (which would penalise length mismatch (aka 'bulbs' vs 'bulbssaur')) is to put some regularising term to make bulbs
match to bulbasour
although clearly it will have a distance shorter than bulbssour
. Or maybe It's simple enough to say "always suggest the one with the shortest distance" and start computing distances only if input > 3 chars + no match from mongo.
And, and, I would computer the distance from the first-to-first... Who would write "sour" instead of "bulba".
@johartl So are you guys implementing Hamming distance algorithm after fetching all of the pokemon names Or should we implement it on ur API side?
@vivek-sethia It would be nice if you guys could take care that. This also seems more like a backend topic to me. But let me know if you need help :)
Agree: it's a backend issue :)
@sacdallago
Using simple Hamming distances, bulbs
and bulbssaur
both result in a distance of 1 when matched against bulbasaur
Quoting what you said,
simple distance (which would penalise length mismatch (aka 'bulbs' vs 'bulbssaur'))
Applying this, bulbs
would be penalized and hence results in a distance of 5, whereas bulbssaur
would have a distance of 1.
clearly it(
bulbs
) will have a distance shorter thanbulbssour
This implies that bulbs
should have a shorter distance, which is contrary to what was derived above. This is what got me confused.
Could you please tell me where I am going wrong?
@vivek-sethia i think it makes more sense to compute hamming distance only for strings of equal length. So if we compare two strings of different length (e.g bulbs vs. bulbasaur) one could take the first few characters only, which means stripping the last characters of the longer string (bulbasaur -> bulba) and then compute the distance
@jonas-he - I agree. But the comments above seemed a bit confusing to me too.
@swathi-ssunder my rationale is not the case you have a longer word than the one you try to match, but the other way around (although, now that you mention it, this is also a concern). I was only trying to explain a rationale, which I think hamming is lacking. Reading wiki:
What I think this implies, thus, is that bulbs
will have a lower chance than bulbssour
to be matched to bulbasour
, although, technically, it should have the same chance.
An idea might be to only consider strings of the same length, aka: bulbs
--> len: 5 --> consider first 5 chars of every name --> compute distance --> highest probability = match
If the length is longer, aka bulbasourd
, then whatever. It would be too computationally expensive to consider every error, which in this case you could address by: take the input and slice off the last 2-3 characters, perform hamming, and then of the overall (keep the string, -2 chars, -3 chars) select the best match.
In the app we would like to display a search bar which lets you query for Pokemon names. For this we will send a request to the
getPokemonByName
route and show the results in a list. Ideally the search bar already shows suggestions as-you-type. The current API, however, will only return results if the entire Pokemon name matches exactly. Would it be possible to also return matches if the search query only partially matches?Another feature request would be to correct small typos, e.g. if the user only misspelled the name by one character ("Balbasaur" instead of "Bulbasaur"). I don't know if that would be a lot of work. Maybe MongoDB already comes with a function for that?