PokemonGoers / PokeData

In this project you will scrape as much data as you can get about the *actual* sightings of Pokemons. As it turns out, players all around the world started reporting sightings of Pokemons and are logging them into a central repository (i.e. a database). We want to get this data so we can train our machine learning models. You will of course need to come up with other data sources not only for sightings but also for other relevant details that can be used later on as features for our machine learning algorithm (see Project B). Additional features could be air temperature during the given timestamp of sighting, location close to water, buildings or parks. Consult with Pokemon Go expert if you have such around you and come up with as many features as possible that describe a place, time and name of a sighted Pokemon. Another feature that you will implement is a twitter listener: You will use the twitter streaming API (https://dev.twitter.com/streaming/public) to listen on a specific topic (for example, the #foundPokemon hashtag). When a new tweet with that hashtag is written, an event will be fired in your application checking the details of the tweet, e.g. location, user, time stamp. Additionally, you will try to parse formatted text from the tweets to construct a new “seen” record that consequently will be added to the database. Some of the attributes of the record will be the Pokemon's name, location and the time stamp. Additional data sources (here is one: https://pkmngowiki.com/wiki/Pok%C3%A9mon) will also need to be integrated to give us more information about Pokemons e.g. what they are, what’s their relationship, what they can transform into, which attacks they can perform etc.
Apache License 2.0
9 stars 6 forks source link

How the additional features should be stored #130

Closed jonas-he closed 8 years ago

jonas-he commented 8 years ago

I have implemented two additional features: relativeTime (which gives time of day at a given location and UTC time) and environment (which gives a classification of the environment at a given location according to this scheme http://glcf.umd.edu/data/lc/). For that i have written two functions in /app/services/common.js. Now the question is if we want to expand our schema for the database and for every sighting compute this additional information upon insertion into the database OR we compute it upon calling our API if needed. My opinion would be to take the second approach, because the other approach would add redundancy to the DB. However i heard that the machine learners would probably get a DB dump and dont "squeeze out" our API so maybe the first approach is better ?

bensLine commented 8 years ago

@jonas-he nice job, thanks! :) looking forward to use the new features for our data!

semioniy commented 8 years ago

@jonas-he the environment feature is really cool. Thanx!

jonas-he commented 8 years ago

@sacdallago any thoughts on this?

sacdallago commented 8 years ago

Sorry. I read through this the first time but somehow forgot to answer. I'm getting old.

The idea is great, but I worry that this is out of scope.

If you want to imement it, definetly go with option two, but since this data is necessary only for the predictions, it should be encapsulated in their projects (as in: they associate a lat/long with terrain features internally, maybe by having a look up table / file based db, without performing further API calls).

By implementing this, potentially your API may be used for a completely different purpose, which is not nice :)

sacdallago commented 8 years ago

Oh. The above applies to the environment data, but for the relative time, I believe you can nest it in the original objects when you return (so, option 1)

jonas-he commented 8 years ago

relative time is implemented (see #158 ), environment feature code is adapted by the prediction group.