PokemonGoers / PokeData

In this project you will scrape as much data as you can get about the *actual* sightings of Pokemons. As it turns out, players all around the world started reporting sightings of Pokemons and are logging them into a central repository (i.e. a database). We want to get this data so we can train our machine learning models. You will of course need to come up with other data sources not only for sightings but also for other relevant details that can be used later on as features for our machine learning algorithm (see Project B). Additional features could be air temperature during the given timestamp of sighting, location close to water, buildings or parks. Consult with Pokemon Go expert if you have such around you and come up with as many features as possible that describe a place, time and name of a sighted Pokemon. Another feature that you will implement is a twitter listener: You will use the twitter streaming API (https://dev.twitter.com/streaming/public) to listen on a specific topic (for example, the #foundPokemon hashtag). When a new tweet with that hashtag is written, an event will be fired in your application checking the details of the tweet, e.g. location, user, time stamp. Additionally, you will try to parse formatted text from the tweets to construct a new “seen” record that consequently will be added to the database. Some of the attributes of the record will be the Pokemon's name, location and the time stamp. Additional data sources (here is one: https://pkmngowiki.com/wiki/Pok%C3%A9mon) will also need to be integrated to give us more information about Pokemons e.g. what they are, what’s their relationship, what they can transform into, which attacks they can perform etc.
Apache License 2.0
9 stars 6 forks source link

How the twitter pokemon go related tweets are being parsed #68

Closed samitsv closed 8 years ago

samitsv commented 8 years ago
sacdallago commented 8 years ago

@juanmirocks

juanmirocks commented 8 years ago

I would collect first a relatively reliable sample of 100 to 1000 tweets containing any of your suggested hashtags or keywords and also containing geolocation data. Based on that, you can manually study how those tweets are written and therefore design a better algorithm for the extraction of information. For example, it may well be that many of those tweets do not contain keywords such as 'caught' and just write down the pokemon's name.

I would also advice on storing the tweets' images, if any. I manually checked on a small sample, that often most of the information is indeed contained in the image.

samitsv commented 8 years ago

@juanmirocks thank you for your feedback. The above strategy was designed after having looked into many tweets related to catching or sighting of pokemons. And so these keywords deliver the notion related to pokemon sightings in pokemongo. Ofcourse some other synonym keywords could be added, but I don't think just a tweet with pokemon's name can be relevant in pokemon sighting. But rather "I saw pikachu or caught pikachu or was attacked by pikachu" are relevant and is covered by the keywords above. And also regarding the images. Unless we plan to add image processing in here, the image can't be 100% sure to contain details without the tweet being relavant because someone can post images completely unrelated to the tweet.

phdowling commented 8 years ago

I'm playing around with your twitter stream at the moment, and I would actually suggest tracking all pokemon names first, an then checking if the tweet was pokemon go related - I think currently you are missing a lot of relevant tweets.

I'm locally changing some stuff in the twitter module since we are dependent on a twitter stream as well, so I will probably just open a PR on this.

phdowling commented 8 years ago

Here's an example of just a few seconds of tweets with the new filter:

got tweet:  A wild Exeggcute appeared! It will be near OFFICE area until 5:56 PM. https://t.co/ssDVp1z4yf #Exeggcute #OFFICE #PokemonGo #NMK
got tweet:  A wild Squirtle has appeared! Available until 06:58:01  (13m 54s). https://t.co/qFhgZU9uYy
got tweet:  Gastly: A wild Gastly has appeared! Available until 04:58:25 (14m 15s). https://t.co/fm7f8VUmyV
got tweet:  Dropped down a lure for Pokemon GO at the #PGCHelsinki venue. Need to catch me a Jigglypuff...
got tweet:  A wild Tangela appeared! It will be near Sangenjaya Station until 6:54 AM. https://t.co/a4uWRe9gsQ #PokemonGo
got tweet:  A wild Squirtle appeared! It will be near Blaine Hill BBQ until 6:52 AM. https://t.co/n0nFIDtWxm
got tweet:  Let's have some fun? !  I'm there-   https://t.co/VScPyKKJwG https://t.co/WPkudNuMh5
got tweet:  Omanyte has appeared near 3864 Wilson Ave, 48906! Available until 06:59:13 (15m 0s). https://t.co/ItUm9hn24x
got tweet:  This is the worst thing imaginable for #PokemonGO Players out there. https://t.co/0Y6hi7gxIm
got tweet:  A wild Pikachu has appeared! Available until 12:54:03  (9m 45s). https://t.co/tlXg37ZEae

As you can see, a lot of bots already tweet pokemon sightings, but they only geo-tag them in encoded urls. Maybe we can still find a good way to leverage this though

gyachdav commented 8 years ago

Seems like you're getting the Lon/Lat on the gmap. Should be straight forward to use gmap API to pull out those coordinates.

sacdallago commented 8 years ago

@gyachdav not so straightforward but doable! If you are at 31 something ave, you might be listed to be at 37 somthing ave, but good enough. It's called reverse geo-coding, someone implemented it yesterday somewhere... BAM: https://github.com/PokemonGoers/PredictPokemon-2/pull/19

sacdallago commented 8 years ago

Oh, and if on the other hand the problem is encoding an address in lat/lng, then it's geo-coding:

https://maps.googleapis.com/maps/api/geocode/json?address={query}

:D man, working for a company without a cent to spend on fancy APIs to query locations does pay out at some stage :D :D

samitsv commented 8 years ago

@sacdallago @gyachdav @phdowling I have implemented in this way here https://github.com/PokemonGoers/PokeData/pull/127