Report Weka results from dummy data

bensLine commented 8 years ago

Hey @goldbergtatyana, the plain dummy data is not really helpful as mentioned before. Features like id, trainerName, userId, deviceId are unique or have almost everytime the same value and in general they I think they have no influence. Furthermore, upvote is always 1 and downvote 0 in the data, so they are also useless.

This leaves us with created, latitude, longitude, pokemonId and we use pokemonId as class, since we evaluate right now the classifier which predicts which Pokemon will appear for a given location and time. When I train a classifier with this data the results are very bad and parameter tuning does not change that, as the features are not good in the raw format. Therefore we should first replace latitude, longitude with S2 #5 and extract time features #8 from the data. With S2 we can group similar location points in one cell and from the time we can extract features like day or night.

However, another big issue, independent of dummy or real data, is the distribution of the Pokemon entries. poke_distribution We have an unbalanced data set since not all Pokemon appear equally often, e.g. Rattata might have 200 samples whereas Zapdos has only 1. As result, we have to use weighted classes, which seem not to be support by the SMO classifier in Weka, but libSVM can be used instead. How would you calculate the class weights in this case?

With default libSVM setup I get this results:

TP Rate 0.358
FP Rate 0.074
Precision 0.420 
Recall 0.358
F-Measure 0.333
MCC 0.308
F-Measure 0.642
ROC Area 0.193

Changing parameters or class weights did not yet result in good improvements, but I still have to test more..

goldbergtatyana commented 8 years ago

nah, I suggest that you shout loud when you have an arff file from the dummy data and then we do the prediction model development together. Though super interesting, it is a tricky process and so it is very good having several watchful minds involved in it in parallel!

Btw, what is the current bottleneck with the arff file? How can I help?

bensLine commented 8 years ago

@goldbergtatyana @goldbergtatyana @goldbergtatyana ! :P the dummy .arff file is already on dev https://github.com/PokemonGoers/PredictPokemon-2/blob/develop/arff/dummy.arff

goldbergtatyana commented 8 years ago

ahh!! this is loud, thanks 👍

good job, @bensLine for trying out weka with the file and also for the analysis of the features. Really very impressive!!!

Now lets start from the beginning and very small:

we need to identify where we put a threshold for the number of unique appearances of a pokemon to be considered for a prediction. We know from your calculation min and max numbers of appearances. However, how many of pokemons in our data set (in %) are witnessed more than 10 times, 20 times, 30 times, and so on. We need to take a look at it. Best is to visualize this info as a cumulative plot (x axis: number of times, y axis is % of the pokemons). Let me know if you need help.
once we have identified the threshold we need to take a look how our data set is composed in terms of how many datapoints describe pokemon 1, how many describe pokemon 2 and so on. We need this information to have an idea of what we are actually dealing with. We can summarize this info in a table.
- then, i suggest to create a classifier first for the pokemon with the most appearances as this will be the simplest classification task. If we do so, then our prediction classes become yes or no. So, every data point where a pokemon is spotted gets a yes class assigned. All other data points (i.e. where other pokemons are spotted) get a no class assigned.
afterwards, yep, we throw out meaningful features such as downvotes and upvotes (as they are always 1 and 0). Also, since we will be prediction pokemons in new locations, we wont have information of user ids, so this feature is also meaningless.
then we run weka (several classification approaches). please report performances of all approaches you will try.
once we have a best performing approach, we will tweak on the prediction score (that discriminates between yes and no classes) to get more correct predictions. but lets do this later. first, lets get the previous points done.

Does it all make sense to you? Just please shout out loud again :D if anything is unclear or you have suggestions/new ideas and comments, which are always always welcome! 😄

bensLine commented 8 years ago

Alright, so we'll build one vs all classifier and compare their results in the end for our predictions?

How we produce the plots is up to us I guess, @goldbergtatyana? E.g. With Python or do we have to use JavaScript?

goldbergtatyana commented 8 years ago

For now yes we start with a simple binary classifier and yes of course up to you how you crate the plots. They are for our internal evaluations only.

bensLine commented 8 years ago

@goldbergtatyana here are the results for the first two points of your comment. The first diagram is for the dummy data set which consists out of about 600 entries dummy_witnessed_ntimes_20160904 however, the results of the second diagram should be more interesting. They are for the API data which was retrieved today and has about 2500 entries apidata_witnessed_ntimes_20160904 It shows that a threshold of 5 would include ~60% of the Pokemon, 10 -> 45%, 15 -> 33%, 20 -> 30%. What do you think about that?

@marwage just created the .arff file for Pidgey (#20), as it is the Pokemon which appears the most in the API and dummy data (ID 16). The file is available at dummy_pidgey.arff

goldbergtatyana commented 8 years ago

Thanks @bensLine for the nice plots. First thing that jumps into my eye is the super limited number of data points (2500). Is this is the data that was collected just for one day? If so, then this is a very small time interval we are looking at. Best would be to go back for one week or even one month (which is another aspect to optimize for best results). Let's start with the interval of one month (~75.000 data points) and then going from we'll look at smaller intervals.

Next, the arff file. Looks great, good job! We can optimize it, however, even further. Let's get other features incorporated in it first and then we'll tackle the optimization! :)

bensLine commented 8 years ago

Thanks @goldbergtatyana, yes that's true we also observed the limited number of data @semioniy contact team A already (https://github.com/PokemonGoers/PokeData/issues/71#issuecomment-244663781) if the get request is limited to return only ~2500 items.

Cool, then we'll start adding the features :)

PokemonGoers / PredictPokemon-2

Report Weka results from dummy data #9