Closed Aurel-Roci closed 8 years ago
nah, I suggest that you shout loud when you have an arff file from the dummy data and then we do the prediction model development together. Though super interesting, it is a tricky process and so it is very good having several watchful minds involved in it in parallel!
Btw, what is the current bottleneck with the arff file? How can I help?
@goldbergtatyana @goldbergtatyana @goldbergtatyana ! :P the dummy .arff file is already on dev https://github.com/PokemonGoers/PredictPokemon-2/blob/develop/arff/dummy.arff
ahh!! this is loud, thanks 👍
good job, @bensLine for trying out weka with the file and also for the analysis of the features. Really very impressive!!!
Now lets start from the beginning and very small:
Does it all make sense to you? Just please shout out loud again :D if anything is unclear or you have suggestions/new ideas and comments, which are always always welcome! 😄
Alright, so we'll build one vs all classifier and compare their results in the end for our predictions?
How we produce the plots is up to us I guess, @goldbergtatyana? E.g. With Python or do we have to use JavaScript?
For now yes we start with a simple binary classifier and yes of course up to you how you crate the plots. They are for our internal evaluations only.
@goldbergtatyana here are the results for the first two points of your comment. The first diagram is for the dummy data set which consists out of about 600 entries however, the results of the second diagram should be more interesting. They are for the API data which was retrieved today and has about 2500 entries It shows that a threshold of 5 would include ~60% of the Pokemon, 10 -> 45%, 15 -> 33%, 20 -> 30%. What do you think about that?
@marwage just created the .arff file for Pidgey (#20), as it is the Pokemon which appears the most in the API and dummy data (ID 16). The file is available at dummy_pidgey.arff
Thanks @bensLine for the nice plots. First thing that jumps into my eye is the super limited number of data points (2500). Is this is the data that was collected just for one day? If so, then this is a very small time interval we are looking at. Best would be to go back for one week or even one month (which is another aspect to optimize for best results). Let's start with the interval of one month (~75.000 data points) and then going from we'll look at smaller intervals.
Next, the arff file. Looks great, good job! We can optimize it, however, even further. Let's get other features incorporated in it first and then we'll tackle the optimization! :)
Thanks @goldbergtatyana, yes that's true we also observed the limited number of data @semioniy contact team A already (https://github.com/PokemonGoers/PokeData/issues/71#issuecomment-244663781) if the get request is limited to return only ~2500 items.
Cool, then we'll start adding the features :)
Hey @goldbergtatyana, the plain dummy data is not really helpful as mentioned before. Features like
id, trainerName, userId, deviceId
are unique or have almost everytime the same value and in general they I think they have no influence. Furthermore,upvote
is always 1 anddownvote
0 in the data, so they are also useless.This leaves us with
created, latitude, longitude, pokemonId
and we usepokemonId
as class, since we evaluate right now the classifier which predicts which Pokemon will appear for a given location and time. When I train a classifier with this data the results are very bad and parameter tuning does not change that, as the features are not good in the raw format. Therefore we should first replacelatitude, longitude
with S2 #5 and extract time features #8 from the data. With S2 we can group similar location points in one cell and from the time we can extract features like day or night.However, another big issue, independent of dummy or real data, is the distribution of the Pokemon entries. We have an unbalanced data set since not all Pokemon appear equally often, e.g. Rattata might have 200 samples whereas Zapdos has only 1. As result, we have to use weighted classes, which seem not to be support by the SMO classifier in Weka, but libSVM can be used instead. How would you calculate the class weights in this case?
With default libSVM setup I get this results:
Changing parameters or class weights did not yet result in good improvements, but I still have to test more..