PokemonGoers / PredictPokemon-2

In this project we will apply machine learning to establish the TLN (Time, Location and Name - that is where pokemons will appear, at what date and time, and which Pokemon will it be) prediction in Pokemon Go.
Apache License 2.0
9 stars 3 forks source link

Report Weka results from dummy data #9

Closed Aurel-Roci closed 8 years ago

bensLine commented 8 years ago

Hey @goldbergtatyana, the plain dummy data is not really helpful as mentioned before. Features like id, trainerName, userId, deviceId are unique or have almost everytime the same value and in general they I think they have no influence. Furthermore, upvote is always 1 and downvote 0 in the data, so they are also useless.

This leaves us with created, latitude, longitude, pokemonId and we use pokemonId as class, since we evaluate right now the classifier which predicts which Pokemon will appear for a given location and time. When I train a classifier with this data the results are very bad and parameter tuning does not change that, as the features are not good in the raw format. Therefore we should first replace latitude, longitude with S2 #5 and extract time features #8 from the data. With S2 we can group similar location points in one cell and from the time we can extract features like day or night.

However, another big issue, independent of dummy or real data, is the distribution of the Pokemon entries. poke_distribution We have an unbalanced data set since not all Pokemon appear equally often, e.g. Rattata might have 200 samples whereas Zapdos has only 1. As result, we have to use weighted classes, which seem not to be support by the SMO classifier in Weka, but libSVM can be used instead. How would you calculate the class weights in this case?

With default libSVM setup I get this results:

TP Rate 0.358
FP Rate 0.074
Precision 0.420 
Recall 0.358
F-Measure 0.333
MCC 0.308
F-Measure 0.642
ROC Area 0.193

Changing parameters or class weights did not yet result in good improvements, but I still have to test more..

goldbergtatyana commented 8 years ago

nah, I suggest that you shout loud when you have an arff file from the dummy data and then we do the prediction model development together. Though super interesting, it is a tricky process and so it is very good having several watchful minds involved in it in parallel!

Btw, what is the current bottleneck with the arff file? How can I help?

bensLine commented 8 years ago

@goldbergtatyana @goldbergtatyana @goldbergtatyana ! :P the dummy .arff file is already on dev https://github.com/PokemonGoers/PredictPokemon-2/blob/develop/arff/dummy.arff

goldbergtatyana commented 8 years ago

ahh!! this is loud, thanks 👍

good job, @bensLine for trying out weka with the file and also for the analysis of the features. Really very impressive!!!

Now lets start from the beginning and very small:

Does it all make sense to you? Just please shout out loud again :D if anything is unclear or you have suggestions/new ideas and comments, which are always always welcome! 😄

bensLine commented 8 years ago

Alright, so we'll build one vs all classifier and compare their results in the end for our predictions?

How we produce the plots is up to us I guess, @goldbergtatyana? E.g. With Python or do we have to use JavaScript?

goldbergtatyana commented 8 years ago

For now yes we start with a simple binary classifier and yes of course up to you how you crate the plots. They are for our internal evaluations only.

bensLine commented 8 years ago

@goldbergtatyana here are the results for the first two points of your comment. The first diagram is for the dummy data set which consists out of about 600 entries dummy_witnessed_ntimes_20160904 however, the results of the second diagram should be more interesting. They are for the API data which was retrieved today and has about 2500 entries apidata_witnessed_ntimes_20160904 It shows that a threshold of 5 would include ~60% of the Pokemon, 10 -> 45%, 15 -> 33%, 20 -> 30%. What do you think about that?

@marwage just created the .arff file for Pidgey (#20), as it is the Pokemon which appears the most in the API and dummy data (ID 16). The file is available at dummy_pidgey.arff

goldbergtatyana commented 8 years ago

Thanks @bensLine for the nice plots. First thing that jumps into my eye is the super limited number of data points (2500). Is this is the data that was collected just for one day? If so, then this is a very small time interval we are looking at. Best would be to go back for one week or even one month (which is another aspect to optimize for best results). Let's start with the interval of one month (~75.000 data points) and then going from we'll look at smaller intervals.

Next, the arff file. Looks great, good job! We can optimize it, however, even further. Let's get other features incorporated in it first and then we'll tackle the optimization! :)

bensLine commented 8 years ago

Thanks @goldbergtatyana, yes that's true we also observed the limited number of data @semioniy contact team A already (https://github.com/PokemonGoers/PokeData/issues/71#issuecomment-244663781) if the get request is limited to return only ~2500 items.

Cool, then we'll start adding the features :)