Upload data set into Kaggle

gyachdav commented 8 years ago

Please generate a Kaggle page for our PokemonGo prediction and turn it into a challenge. For the time being keep the challenge invite only. Please upload your datasets onto the kaggle page.

Here is an example of a Kaggle page that was inspired by our Game of Thrones project and uses the datasets we generated.

And now to this semester's surprise challenge:

Once the page is set up there will be a group from Microsoft Bing's Core Relevance and Ranking that will be invited to the challenge and try offer their own predictions.

We have all confidence that the TUM team will come on top! 🙏

goldbergtatyana commented 7 years ago

Oh oh, what does this mean for the result of our predictions??

bensLine commented 7 years ago

Ok, seems like that was the problem. I looked over several entries in the data set and most have had this mismatch. sometimes they are correct but this is just by chance.

I'm not sure, if the data set, which was used for our results, included weather features, then the results are most likely invalid :( And as far as I know they are... @MatthiasBaur did you run the latest test completely without weather features, so that they have not been in the .arff file?

semioniy commented 7 years ago

@goldbergtatyana till now we predicted the wrong stuff)

goldbergtatyana commented 7 years ago

thats super strange to me, because on the new data set @MatthiasBaur got just the same result as on this one.

semioniy commented 7 years ago

if we assume that with our previous dataset (with wrong classes) our prediction was comparable to random, then our current prediction is not much better.

goldbergtatyana commented 7 years ago

can you give me numbers for random @semioniy ?

semioniy commented 7 years ago

I don't have any numbers. Just thought, if till now we predicted wrong classes, it is the same as random.

goldbergtatyana commented 7 years ago

ok, one way to check for the random performance is to write a small program that will randomly assign one of 151 classes to 10.000 instances of @MatthiasBaur . Then the overall accuracy would be (number_of_correct_predictions)/10.000.

Another way is to calculate the random performance mathematically and it goes as following: ((1/151 * frequency_of_pokemon1)+(1/51 * frequency_of_pokemon2)+...+(1/151 * frequency_of_pokemoon_151)) / 10.000

we assume that the predictor assigns one of 151 classes with the same probability, therefore each class gets 1/151 probability for being predicted. frequency_of_pokemon1 is the number of data points of pokemon1 in @MatthiasBaur 10K data set.

So, what is the random performance?

goldbergtatyana commented 7 years ago

Guys, please update the file and upload a new version of it on Kaggle ASAP ☝️

semioniy commented 7 years ago

@goldbergtatyana I still gather weather for it, as we had it in the first dataset. It takes time.

goldbergtatyana commented 7 years ago

I thought that the mistake happened when the data sets were merged (weather and the remaining set)? If this is the case, then there is no need in recalculating the weather features again.

semioniy commented 7 years ago

The mistake happened because of the bug. The bug is resolved now, so I just need to re-gather the dataset.

semioniy commented 7 years ago

A new version is there, ready for testing.

goldbergtatyana commented 7 years ago

Yep, the classes and pokemons Ids match now, thank you @semioniy. Another unexpected observation I made is that for most of sightings the the co-occurences with other pokemons are different between version2 and version3.

Can you explain where does this difference come from?

bensLine commented 7 years ago

@goldbergtatyana the co-occurrences calculation was also affected by the same bug. Thanks for pointing it out!

sacdallago commented 7 years ago

New statistics? 🏄

semioniy commented 7 years ago

My best result till now (with both 10k and 300k) was 21,8% right predicted pokemon, with only 8 features.

goldbergtatyana commented 7 years ago

That is great @semioniy! You need to telll us what you did, what were features and what us random?

semioniy commented 7 years ago

I took the 10k dataset after the bug was corrected, classified with bayesnet while deleting features, and ended up with Voting (BayesNet+IBK) with closeToWater, city, urban, cooc13, cooc16, cooc19, cooc129 and got 21,8% correct predictions. Got same result afterwards with 300k, so 10k is enough. As I wrote already about random, it's pure my perception, I did no math. Just with Bauchgefühl I thing real random will have 6-7%. Just thought that if we predict wrong thing, it's the same as not predicting and just picking one by chance.

goldbergtatyana commented 7 years ago

I took the 10k dataset after the bug was corrected, classified with bayesnet while deleting features, and ended up with Voting (BayesNet+IBK) with closeToWater, city, urban, cooc13, cooc16, cooc19, cooc129 and got 21,8% correct predictions. Got same result afterwards with 300k, so 10k is enough.

Cool!

As I wrote already about random, it's pure my perception, I did no math. Just with Bauchgefühl I thing real random will have 6-7%.

That's cute to read but the approach of Bauchgefühl is not scientific at all 😈 at all 😈 at all 😈 !!! A naive random would be to divide 100% by 151 classes, but this calculation asumes that each class is repesented by an equal number of pokemons. A correct way to calculate random is described a few comments above.

So, what is random?

semioniy commented 7 years ago

((1/151 * frequency_of_pokemon1)+(1/151 * frequency_of_pokemon2)+...+(1/151 * frequency_of_pokemoon_151)) / 10.000

= 1/151 x (frequency_of_pokemon1+frequency_of_pokemon2...+frequency_of_pokemon151)/1000; sum of frequencies = 1, so it is 1/151x1/1000

I'll think of how to figure this number out, but why do we need it in the first place?

goldbergtatyana commented 7 years ago

I'll think of how to figure this number out

It is simpler than you think. All you need to do is to count the number of pokemon 1 in our 10K dataset, of pokemon 2 and so on

why do we need it in the first place?

how else do we know that we are not random?

semioniy commented 7 years ago

Well, after couple of hours of attempts I can conclude: If this is right: numOfRightPredicted = SUM((1/numOfClasses)*numOfInstancesInEachClass/numOfAllInstances), then in both cases (even und uneven distribution) number of right predicted instances is equal to 1/numOfClasses. In that case our random would be 1/151=0,71%, or even less. We have 21-22%, so it seems to be not random. Still have no idea, how could we get 21% with wrong classes.

goldbergtatyana commented 7 years ago

Which reminds me @semioniy of the kaggle thing again 😼

I highlighted a few issues and left comments to them in the google doc that I see are missing on the kaggle site. Please add those to the description of our data set: https://docs.google.com/document/d/1dIKvxOshOCnu2by5gIQR3rceUAs_4cfhg3TDxS9bPnM/edit

And one more of 'something is missing': what happened to our awesome 2G sightings data set? Is there help needed for splitting it? Let us know.

goldbergtatyana commented 7 years ago

@semioniy: NOW we need the 2G dataset on Kaglle, this really cannot wait anymore.

Too many people are looking at the data and one of our jewels is not up yet 😿What problems are you running into with uploading the set?

semioniy commented 7 years ago

It is already there, in discussion Datadump.

goldbergtatyana commented 7 years ago

Oh ok, I'm only on mobile since yesterday, so couldn't see it being on kaggle myself yet. Thanks for putting it up and thanks for working on a more comprehensive description for the first file, @semioniy !

semioniy commented 7 years ago

Descriptions on Kaggle have been updated.

sacdallago commented 7 years ago

I think we can close this for now

PokemonGoers / PredictPokemon-2

Upload data set into Kaggle #35

And now to this semester's surprise challenge: