Description of the ML method

goldbergtatyana commented 7 years ago

Hi Pokemon Predictors,

Thanks for training a new classifier. Please summarize the features you used for the final model and the configuration of the classifier (providing the command line call is enough) in this issue, as all this information is scattered at the moment among the whole repository.

Please also provide your data sets (arff files) as well as the cross-validated results of your classifier here.

Thank you!

bensLine commented 7 years ago

True, everywhere are bits and pieces x) However, the information concerning the classifier is in the wiki, I copied it from there:

Attributes

latitude
longitude
appearedMinute (Discretized into 4 bins)
closeToWater
urban
midurban
rural
gymIn100m
gymIn1000m
pokestopIn250m
pokestopIn2500m
cooc_13
cooc_16
cooc_19
cooc_96
cooc_129
(class)

Classifier weka.classifiers.meta.Vote -S 1 -B "weka.classifiers.lazy.IBk -K 100 -W 0 -A \"weka.core.neighboursearch.LinearNNSearch -A \\\"weka.core.EuclideanDistance -R first-last\\\"\"" -B "weka.classifiers.bayes.BayesNet -D -Q weka.classifiers.bayes.net.search.local.K2 -- -P 1 -S BAYES -E weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 0.5" -R PROD

@MatthiasBaur can you upload the training set and the corresponding result of the cv?

goldbergtatyana commented 7 years ago

Thanks @bensLine . Not sure the information in the wiki is up-to-date though. According to @semioniy in #35 there should only be 7 features and the rare classes should not be combined.

Can you guys double check?

semioniy commented 7 years ago

@goldbergtatyana I used 7 features myself. Don't think though that this decision reached the release - probably I had it only on my pc.

goldbergtatyana commented 7 years ago

You competely lost me. I thought that the configuratoin that @MatthiasBaur came up with was based on the data set with mistakes. Once the mistakes were spotted and corrected, @semioniy found a different set of features and parameters for the ML algorithm to get best results. Now, why are we using the old setting then?

Also, I'd like to know and the map team as well, what is the threshold of the predictions you decided to be? That is below what prediction probability we neglect pokemon predictions?

We should have done it all at least two weeks ago 😑

semioniy commented 7 years ago

@semioniy found a different set of features and parameters for the ML algorithm to get

same results. So we didn't change anything as it made no sense.

Dont know anything about threshold. @bensLine, @MatthiasBaur, @Aurel-Roci, @marwage?

goldbergtatyana commented 7 years ago

@PokemonGoers/predictpokemon-2 the prediction part is the core of the entire app, so this part has do be done properly. The previous performance was reported on a data set that had mistakes!!! We cannot go for the results based on that data. @semioniy did another proper benchmarking. It won't take much effort to change a few lines of your code to adapt it to parameters and features Semion found to be important. I'm sure you can do it!

@MatthiasBaur we spoke about setting a threshold a few times and I had an impression that you understood it. Please either set the threshold yourself (with explanation of course) or make sure your group does it, so that we do not predict a pokemon at each point of the 9x9 matrix.

Guys, we are really short on time. I urge to take care of these two last items ASAP!

bensLine commented 7 years ago

Hey @goldbergtatyana, they wy I understood it is that the results from @MatthiasBaur were still correct as the mistakes had not affected the data set he used. Anyway, we can change the parameters, if we still have them @semioniy?

There is now a threshold parameter which can be set for the prediction package, but I'm not sure about the value, as we cannot test on real data. https://github.com/PokemonGoers/PokeData/issues/184#issuecomment-253855885:

the latest data is from around 24th of septembre. Please note that the data there is not really thought to be something to work with. If you want nice and consistent data ask @sacdallago (im thinking of the 6GB thing).

@sacdallago is the data now available? the last info was about AWS fax verification^^

goldbergtatyana commented 7 years ago

Thanks for the super quick reply @bensLine !

@MatthiasBaur results must have been affected by the mistakes as the classes were swapped for instances. Yes, it would be great if you guys could adapt code to @semioniy results soon!

Now to the threshld. @semioniy will provide you with his results from the cross-validation. Just decide for a thresohold (e.g. probability above the threshold of 80) and see how many pokemons are still predicted correctly and how many of the correctly predicted ones are dismissed (accuracy vs. coverage, also knows as precision vs. recall, i provided the formulas in earlier comments). We expect that the higher we set the threshold the more accurate predictions we get but at the same time we dismiss more of the correct predictions. So, the trick here is to find a good balance between accuarcy and coverage. Let me know if you have questions. This is really simple!

bensLine commented 7 years ago

@goldbergtatyana thanks for the explanation. 40 seems to be the max value from looking at the data dump set. However, we'll look into that after the features are adapted.

goldbergtatyana commented 7 years ago

What is 40? The threshold needs to be derived from predictions probability scores you got during cross-validation.

semioniy commented 7 years ago

That's a citation of what i wrote in #35: Voting (BayesNet+IBK) with closeToWater, city, urban, cooc13, cooc16, cooc19, cooc129 and got 21,8% correct predictions. Got same result afterwards with 300k, so 10k is enough.

sacdallago commented 7 years ago

@bensLine things have changed a lot :) We have a database up and running but it's mining only twitter, as the other services are down at the time :( And, and: no. AWS isn't solved, so I'm using GCP, but I am exchanging mails with AWS to see if we get that up and running..

bensLine commented 7 years ago

Pity! :/ where should we retrieve the data from @sacdallago? still http://pokedata.c4e3f8c7.svc.dockerapp.io:65014/api/pokemon/sighting/ts/2016-09-14T08:00:00.000Z/range/1d ?

@goldbergtatyana indeed, the 40% is the confidence threshold for predictions. Derived from looking at the results of the data dump.

goldbergtatyana commented 7 years ago

Can you describe how you decided on the 40% threshold, @bensLine?

goldbergtatyana commented 7 years ago

Hey @PokemonGoers/predictpokemon-2 it has already been five days and there is still no description to the answer you figured within one day. i.e. what made you conclude 40% to be a threshold for the ML predictions of pokemons?

I dont want to repeat myself again about the math behind the threshold. If you are unsure about it however, I will be happy to go over it with you, also in person.

Let's get it finally done!

bensLine commented 7 years ago

@goldbergtatyana yes, sorry. The 40% are not final, it was just a guess and I only used some test data. for it. I think we should define the threshold on real data from our API. However, the API https://predictemall.online/api/pokemon/sighting/ts/2016-11-01T08:00:00.000Z/range/1h seems to be down right now. @sacdallago do you know anything?

goldbergtatyana commented 7 years ago

Thanks @bensLine. You decide on the threshold based on the cross-validated predictions of @semiony. You take his predictions where we have for every instance information on observed and predicted classes. Then you treat everything below a prediction probability X (i.e. your threshold) as false predictions and then compute accuracy and coverage. The goal is to have accuracy and coverage at the maximum, while keeping both in balance.

Let me know if you need more help.

bensLine commented 7 years ago

ok, here we go. what do think about a threshold of 65% @goldbergtatyana? this would result in an accuracy of 40% and a coverage of 15% according to the plot, which is based on the cross-validated predictions. So we would have about 12 predictions for the 9x9 grid. accuracy = #correct / #predictions considering only predictions above the threshold coverage = #aboveThreshold / #predictions considering all predictions

goldbergtatyana commented 7 years ago

Good job on the graph, @bensLine ! We would suggest to rather go for the 50% threshold at which the accuracy is still at mind blowing 30% and the coverage is also not as low (it is then at 40%).

PokemonGoers / PredictPokemon-2

Description of the ML method #70