PokemonGoers / PokeData

In this project you will scrape as much data as you can get about the *actual* sightings of Pokemons. As it turns out, players all around the world started reporting sightings of Pokemons and are logging them into a central repository (i.e. a database). We want to get this data so we can train our machine learning models. You will of course need to come up with other data sources not only for sightings but also for other relevant details that can be used later on as features for our machine learning algorithm (see Project B). Additional features could be air temperature during the given timestamp of sighting, location close to water, buildings or parks. Consult with Pokemon Go expert if you have such around you and come up with as many features as possible that describe a place, time and name of a sighted Pokemon. Another feature that you will implement is a twitter listener: You will use the twitter streaming API (https://dev.twitter.com/streaming/public) to listen on a specific topic (for example, the #foundPokemon hashtag). When a new tweet with that hashtag is written, an event will be fired in your application checking the details of the tweet, e.g. location, user, time stamp. Additionally, you will try to parse formatted text from the tweets to construct a new “seen” record that consequently will be added to the database. Some of the attributes of the record will be the Pokemon's name, location and the time stamp. Additional data sources (here is one: https://pkmngowiki.com/wiki/Pok%C3%A9mon) will also need to be integrated to give us more information about Pokemons e.g. what they are, what’s their relationship, what they can transform into, which attacks they can perform etc.
Apache License 2.0
9 stars 6 forks source link

Filter for identical sightings #175

Closed MatthiasBaur closed 8 years ago

MatthiasBaur commented 8 years ago

Hi! Do you have a filter implemented to weed out identical datapoints? Team Predict

jonas-he commented 8 years ago

@MatthiasBaur In the current data provided through the demo there could be duplicates. But for the data which was gathered on the rostlab machines there should be no duplicates.

MatthiasBaur commented 8 years ago

Thanks for the info.

goldbergtatyana commented 7 years ago

is a sighting reported by two different people but of the same pokemon at the same location and at the same time is considered to be a duplicate, @jonas-he ?

vivek-sethia commented 7 years ago

@goldbergtatyana - If they are from the same data source, they are considered duplicate and hence only the first one is considered. If they are from different data sources, then both are retained in the db.

goldbergtatyana commented 7 years ago

thanks @vivek-sethia , so we might have duplicates in that sense. What about sightings of the same pokemon at the same time, but at two different locations (with a distance of few meters from each other)? This would also be two duplicate entries that would probably also be contained in the database?

vivek-sethia commented 7 years ago

@goldbergtatyana You are right, that will also be contained in the database, since exact locations ( lat, long) are considered.

goldbergtatyana commented 7 years ago

FYI @PokemonGoers/predictpokemon-2

jonas-he commented 7 years ago

@vivek-sethia I thought that data source does not matter, only place, time and pokemonId as this is what is in /app/models/pokemonSighting.js: pokemonSighting.index({"appearedOn": -1, "pokemonId": 1, "location": 1}, {"unique": true});

vivek-sethia commented 7 years ago

@goldbergtatyana Got bit confused. It will only check for time, location, and pokemon idand hence there won't be duplicates from different sources. Thanks for the correction @jonas-he . However, the current data from mlab does not have this index. So there will be duplicates. Once we have the data coming from Rostlab server, it should be fine.

goldbergtatyana commented 7 years ago

Ok so, if time, pokemon_id will be the same for two entries but their locations will be a bit different (in say 1 meter distance from each other), then these would be other duplicates that we will have. It is ok this way though!

@PokemonGoers/predictpokemon-2 team will be drawing randomly 10K sightings from the data of the last few days. Using this approach it is very unlikely that we will catch duplicates.