PokemonGoers / PredictPokemon-2

In this project we will apply machine learning to establish the TLN (Time, Location and Name - that is where pokemons will appear, at what date and time, and which Pokemon will it be) prediction in Pokemon Go.
Apache License 2.0
9 stars 3 forks source link

big data dump #38

Closed jonas-he closed 8 years ago

jonas-he commented 8 years ago

may be of interest: http://pokemongohub.net/data-mining-500-000-pokemon-spawns-encounters/ original reddit thread: https://www.reddit.com/r/pokemongodev/comments/51pfvh/large_pokemon_spawn_dump/

MatthiasBaur commented 8 years ago

This sounds promising. Thanks for pointing it out! We can use this to explore the learning algorithms before the data team provides more. Next step is to create a corresponding .arff file.

goldbergtatyana commented 8 years ago

@MatthiasBaur : Yes, the weather features are up, but the API is to slow to collect enough data for every instance (~1 query per second). @sacdallago has a suggestion for you to speed up things: Make chunks of 250x250km (or bigger) of the planet → Take weather info every 6h (or more) of the corners of the square and the center and then either average the weather (rain, sun, etc.) of the five values or select the most frequent one.

The s2 library has the feature of telling you whats the corner and whats the center.

goldbergtatyana commented 8 years ago

btw, @MatthiasBaur @Aurel-Roci @bensLine @marwage @semioniy can you come to the rostlab today (01.09.059) or could we all get on a skype call to go through the feature list and talk about the prediction strategy?

bensLine commented 8 years ago

@goldbergtatyana I can skype today anytime before 5pm

bensLine commented 8 years ago

@goldbergtatyana btw, here is a plot for the data dump (~600k entries) about the pokemon distribution and the percentage of pokemons in the data set with more than n entries.

pokedump_witnessed_ntimes_20160921

goldbergtatyana commented 8 years ago

thanks @bensLine for the very nice plots! I see that if we ignore pokemons with less than 20 sightings then we loose only 5% of them (i.e. 8 pokemons only). We should go for it.

Yesterday your group suggested to balance the data set for training. I think it is a great idea! Check out weka's SMOTE method http://stackoverflow.com/questions/22632932/how-to-set-parameters-in-weka-to-balance-data-with-smote-filter that is nicely designed for doing just that. Please use your subsample of 50K data points to see if SMOTE improves the performance. Let me know if you need help!

MatthiasBaur commented 8 years ago

The Big Data set was used for selecting the best possible classifier. It is also going to be uploaded (enriched by our features) to Kaggle (See #35 ). I will reference a issue with exhaustive information about the classifiers shorty.

MatthiasBaur commented 8 years ago

62