PokemonGoers / PokeData

In this project you will scrape as much data as you can get about the *actual* sightings of Pokemons. As it turns out, players all around the world started reporting sightings of Pokemons and are logging them into a central repository (i.e. a database). We want to get this data so we can train our machine learning models. You will of course need to come up with other data sources not only for sightings but also for other relevant details that can be used later on as features for our machine learning algorithm (see Project B). Additional features could be air temperature during the given timestamp of sighting, location close to water, buildings or parks. Consult with Pokemon Go expert if you have such around you and come up with as many features as possible that describe a place, time and name of a sighted Pokemon. Another feature that you will implement is a twitter listener: You will use the twitter streaming API (https://dev.twitter.com/streaming/public) to listen on a specific topic (for example, the #foundPokemon hashtag). When a new tweet with that hashtag is written, an event will be fired in your application checking the details of the tweet, e.g. location, user, time stamp. Additionally, you will try to parse formatted text from the tweets to construct a new “seen” record that consequently will be added to the database. Some of the attributes of the record will be the Pokemon's name, location and the time stamp. Additional data sources (here is one: https://pkmngowiki.com/wiki/Pok%C3%A9mon) will also need to be integrated to give us more information about Pokemons e.g. what they are, what’s their relationship, what they can transform into, which attacks they can perform etc.
Apache License 2.0
9 stars 6 forks source link

getPokemonByName: Partial search #136

Closed johartl closed 8 years ago

johartl commented 8 years ago

In the app we would like to display a search bar which lets you query for Pokemon names. For this we will send a request to the getPokemonByName route and show the results in a list. Ideally the search bar already shows suggestions as-you-type. The current API, however, will only return results if the entire Pokemon name matches exactly. Would it be possible to also return matches if the search query only partially matches?

"bulbasaur" -> [Bulbasaur]
"bulba" -> [Bulbasaur]
"basa" -> [Bulbasaur]

Another feature request would be to correct small typos, e.g. if the user only misspelled the name by one character ("Balbasaur" instead of "Bulbasaur"). I don't know if that would be a lot of work. Maybe MongoDB already comes with a function for that?

swathi-ssunder commented 8 years ago

@johartl - The first requirement is quite straight-forward, and can be done using regular expressions in MongoDB queries. I tried a sample query and it works.

The second one, regarding partial/fuzzy string match, however seems to need something like ElasticSearch. I am looking into other alternatives of achieving this, but just in case, would like to know if it is fine to use this. @sacdallago ??

phdowling commented 8 years ago

If you're only suggesting for like ~150 pokemon, then I suggest you forget about using elastic search - just compute the hamming distances in memory, or think of a simple matching algo. ES is a pretty big dependency!

On Sep 10, 2016 10:38 PM, "Swathi S Sunder" notifications@github.com wrote:

@johartl https://github.com/johartl - The first requirement is quite straight-forward, and can be done using regular expressions in MongoDB queries. I tried a sample query and it works.

The second one, regarding partial/fuzzy string match, however seems to need something like ElasticSearch https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html . I am looking into other alternatives of achieving this, but just in case, would like to know if it is fine to use this. @sacdallago https://github.com/sacdallago ??

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PokemonGoers/PokeData/issues/136#issuecomment-246137967, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1hd9Ih7871clExRrCTMUTjt4HkQZ1uks5qoxU0gaJpZM4J5xgI .

johartl commented 8 years ago

I guess we could as well just retrieve all Pokemon once and do the search on the mobile device. No need to call the API at all?

sacdallago commented 8 years ago

Agree with @phdowling , ES is a bit too much. I know that mongo doesn't have an out of the box solution for it, and running a brute force alg with substitutions for each character is pretty demanding. But the hamming distance idea sounds like a good one, especially because we only have 151 names and computing it against this list should be quick. An additional thing to keep in mind for the simple distance (which would penalise length mismatch (aka 'bulbs' vs 'bulbssaur')) is to put some regularising term to make bulbs match to bulbasour although clearly it will have a distance shorter than bulbssour. Or maybe It's simple enough to say "always suggest the one with the shortest distance" and start computing distances only if input > 3 chars + no match from mongo.

And, and, I would computer the distance from the first-to-first... Who would write "sour" instead of "bulba".

vivek-sethia commented 8 years ago

@johartl So are you guys implementing Hamming distance algorithm after fetching all of the pokemon names Or should we implement it on ur API side?

johartl commented 8 years ago

@vivek-sethia It would be nice if you guys could take care that. This also seems more like a backend topic to me. But let me know if you need help :)

sacdallago commented 8 years ago

Agree: it's a backend issue :)

vivek-sethia commented 8 years ago

@sacdallago

Using simple Hamming distances, bulbs and bulbssaur both result in a distance of 1 when matched against bulbasaur

Quoting what you said,

simple distance (which would penalise length mismatch (aka 'bulbs' vs 'bulbssaur'))

Applying this, bulbs would be penalized and hence results in a distance of 5, whereas bulbssaur would have a distance of 1.

clearly it(bulbs) will have a distance shorter than bulbssour

This implies that bulbs should have a shorter distance, which is contrary to what was derived above. This is what got me confused.

Could you please tell me where I am going wrong?

jonas-he commented 8 years ago

@vivek-sethia i think it makes more sense to compute hamming distance only for strings of equal length. So if we compare two strings of different length (e.g bulbs vs. bulbasaur) one could take the first few characters only, which means stripping the last characters of the longer string (bulbasaur -> bulba) and then compute the distance

swathi-ssunder commented 8 years ago

@jonas-he - I agree. But the comments above seemed a bit confusing to me too.

sacdallago commented 8 years ago

@swathi-ssunder my rationale is not the case you have a longer word than the one you try to match, but the other way around (although, now that you mention it, this is also a concern). I was only trying to explain a rationale, which I think hamming is lacking. Reading wiki:

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In another way, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.

What I think this implies, thus, is that bulbs will have a lower chance than bulbssour to be matched to bulbasour, although, technically, it should have the same chance.

An idea might be to only consider strings of the same length, aka: bulbs --> len: 5 --> consider first 5 chars of every name --> compute distance --> highest probability = match

If the length is longer, aka bulbasourd, then whatever. It would be too computationally expensive to consider every error, which in this case you could address by: take the input and slice off the last 2-3 characters, perform hamming, and then of the overall (keep the string, -2 chars, -3 chars) select the best match.