Upload data set into Kaggle

gyachdav commented 8 years ago

Please generate a Kaggle page for our PokemonGo prediction and turn it into a challenge. For the time being keep the challenge invite only. Please upload your datasets onto the kaggle page.

Here is an example of a Kaggle page that was inspired by our Game of Thrones project and uses the datasets we generated.

And now to this semester's surprise challenge:

Once the page is set up there will be a group from Microsoft Bing's Core Relevance and Ranking that will be invited to the challenge and try offer their own predictions.

We have all confidence that the TUM team will come on top! 🙏

gyachdav commented 8 years ago

Status?

semioniy commented 8 years ago

Status - still busy with weather)

sacdallago commented 8 years ago

For the sightings data: I am creating dump of the production DB which collected over the last 3 days (almost 6 gigs of data).

How is it on your end @semioniy ??

semioniy commented 8 years ago

@sacdallago we now have the 50k dataset with all current features.

sacdallago commented 8 years ago

Oh nice! :)

semioniy commented 8 years ago

Now the question is only, how Kaggle works, and how fast they answer me.

semioniy commented 8 years ago

BTW, @gyachdav @sacdallago, when kaggle answers me about making a challenge, should I promise some kind of prize? As I understood, that is the sense of Kaggle, finding a better method and winning a prize.

gyachdav commented 8 years ago

No prize. There are many challenges that are there for fame and glory.

Sent from my iPhone

On Sep 26, 2016, at 12:35 PM, semioniy notifications@github.com wrote:

BTW, @gyachdav @sacdallago, when kaggle answers me about making a challenge, should I promise some kind of prize? As I understood, that is the sense of Kaggle, finding a better method and winning a prize.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

goldbergtatyana commented 8 years ago

you'll update us on the news from kaggle, right @semioniy ? 😈

semioniy commented 8 years ago

Yeah, sure. But there' not much to say though. They contacted me, asking what's the project about and what we want. I answered them and now wait for their reaction. Bad thing, that we only contact each other via email, so the conversation is pretty slow.

gyachdav commented 8 years ago

Please email Kaggle again and ask for a status update. please put us (mentors) in cc. Thanks.

On Sep 29, 2016, at 7:54 PM, semioniy notifications@github.com wrote:

Yeah, sure. But there' not much to say though. They contacted me, asking what's the project about and what we want. I answered them and now wait for their reaction. Bad thing, that we only contact each other via email, so the conversation is pretty slow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

semioniy commented 8 years ago

@gyachdav we emailed today, so there is probably no status update yet. When they write me next time, I'll put you in cc.

semioniy commented 8 years ago

https://www.kaggle.com/semioniy/predictemall

semioniy commented 8 years ago

I uploaded and made it ready, maybe they just need some time.

goldbergtatyana commented 8 years ago

Thanks @semioniy. The kaggle page looks good and I especially like your description to the challenge. What I am missing though is the description of the data points and their features, i.e. what were the sources used to get the data and what are the possible values of our features?

sacdallago commented 8 years ago

Also please note that a new dataset should be uploaded, which is the one mined from our own team A @PokemonGoers/pokedata . Dataset is on its way trough something similar to UPS in the form of our magnificent @goldbergtatyana :laughing:

goldbergtatyana commented 8 years ago

Thanks @sacdallago for creating the file!

The json file with the copy of our MongoDB can now be downlaoded from https://rostlab.org/~goldberg/catchemall.json

@semioniy please make a new csv file with features for the data from this file and upload it on kaggle @MatthiasBaur please go ahead and use this data to verify the prediction performance of the algorithm you found to provide best results

goldbergtatyana commented 8 years ago

The file is ~2G in size and contains ~9M entries.

It can be downloaded to your linux machine directly using

wget https://rostlab.org/~goldberg/catchemall.json

semioniy commented 8 years ago

Hey, @goldbergtatyana, @sacdallago, first you told me to upload a dataset with as many features as possible. But now it's eather all features, or 9M dataset. If I upload a 9M dataset, it's only possible with less features, which basically makes it an other dataset, not a New Version.

P.S. *because I can't gather weather for 9Mil entries I mean, and with different features it's a different dataset.

sacdallago commented 8 years ago

@semioniy thanks for the update. Can you not upload this as an additional dataset, as is? Not necessary to calculate the features for these, for now -at least, not for the kaggle-. You are totally correct: this should not replace the existing dataset, it is not meant as a new version, simply because most of these points are not verifiable and we actually have knowledge that some are purely statistical with no relevance (though we don't know which ones).

We can sample some of the points in this data and calculate features for those, in case. Internally. Additionally this data presents a source label, which might help figuring out which services are reliable and which ones are not. And I guess, that's what @MatthiasBaur is having fun with ATM.

MatthiasBaur commented 8 years ago

Lots of fun :)

goldbergtatyana commented 8 years ago

ok just to repeat, @semioniy please upload the new file as an additional file for sightings. Just please add descriptions for both files.

@MatthiasBaur we need to sample features for 10K randomly drawn datapoints from this file (preferably of the last day, thats what you did on the prev data set, right?), but only those features that you found to be useful for the reddit data set. Then we nee dto apply 10 fold cv to confirm the prediction performance of ~21% on this data set as well.

goldbergtatyana commented 8 years ago

btw, @MatthiasBaur and @semioniy since we cannot predict rare pokemons anyways (too little data and they seem not to follow appearance patterns of other pokemons) then we dont need to combine them in one class, why the extra work?

goldbergtatyana commented 8 years ago

@semioniy @semioniy please upload the new dataset to kaggle, including descriptions of the data sets. Let us know if you need help. Thank you!

semioniy commented 8 years ago

@goldbergtatyana sorry, had problems with internet in my dorms, now online again. I can't download the dataset, can you upload it somewhere else? Dropbox would be perfect.

P.S do I understand right?: Now I'm responsible for Kaggle page. People keep asking questions there)

goldbergtatyana commented 8 years ago

That's cool @semioniy, here is the link https://www.dropbox.com/s/5js9vvsgerph0pi/catchemall.json?dl=0

semioniy commented 8 years ago

Hi, @goldbergtatyana, This file is not much of a use, cause neither webstorm, nor any text editor can handle it. 9Mil seems to be too much. Maybe, if it were 9 files a million entries each, I could convert them, but still I doubt there would be a way to merge them together not using some server-guys-tricks (which I don't know). As is, there is not much sense in uploading the file, I think, cause it's just coordinates, date and class, isn't it?

goldbergtatyana commented 8 years ago

Hey @semioniy , there is no need to open the file with a text editor as long as we provide descripton for the file and maybe the first ten lines of the file as a sample 😏

Please go ahead and upload the whole file as is. The file will then be read programatically or with linux tools.

sacdallago commented 8 years ago

@semioniy

a) If using win best new feature of windows 10: LINUX

https://msdn.microsoft.com/en-us/commandline/wsl/install_guide

b) How to check out file contents

goldbergtatyana commented 8 years ago

Nice @sacdallago , thanks a lot for the hints 👏

@MatthiasBaur can you confirm the same prediction performance of ~21% on randomly chosen 10K data points from the big set?

sacdallago commented 8 years ago

Ah, ah. Since this is a js seminar. You can do the following too:

Start node from within the folder with that file (let's suppose it's called file.json).
Once the node console is open, write var t = require('file');
Now you can do whatever you want wit those entries (it's an array, if I'm not mistaken). Eg, from the console:

> console.log(t[0]); // First element
> var last_element = t[t.length - 1]; // Last element
> console.log(last_element );
> for (var i = 0; i < 10; i++) { console.log(t[i]); } // First 10 elements
> var item = t[Math.floor(Math.random()*t.length)]; // Random element

if you write a js script that randomly draws x elements from the array (with or without substitution), you can even build your own 10K dataset.

No need to ever use a text editor :)

semioniy commented 8 years ago

@sacdallago thanx for advice about node. After minor changes ('./file' instead of 'file') it worked, but only with smaller files. Located in the same folder, catchemall.json doesn't even get imported after require. I get an error: Error: toString failed I assume it's because of the file size, but, maybe, the file is corrupted. Did you, or @goldbergtatyana try to open it yourself? I'll try sublime now and write, how it went.

Update It worked with sublime, now I'll upload the file. BTW yes, this could not be read as an array because [ and ] were missing.

gyachdav commented 8 years ago

@semioniy check the upload restrictions on Kaggle

semioniy commented 8 years ago

Aaand, this is what Kaggle says when uploading the file:

The zip file 'catchemall.zip' contains a file 'catchemall.json' (1.94 GB) that exceeds the max size allowed of 500.00 MB.

gyachdav commented 8 years ago

yeah thought so... just split it up then to several 500MB files.

on debian/ubuntu you can:

split --bytes=500M /path/to/catchemall.json

sacdallago commented 8 years ago

@gyachdav won't this deteriorate the data formatting? @semioniy you might need to look at the end and beginnings of files, making sure objects are closed and not truncated and that the array form is kept consistent, aka:

END OF FILE 1

...}, {"timestamp": "UT

BEGGINING OF FILE 2

C-xxx", "some attribute": "some value",..},...

TO

...}]

[{"timestamp": "UTC-xxx", "some attribute": "some value",..},...

gyachdav commented 8 years ago

Or you can just have in the readme instructions how to rebuild the files (using join). Either way works. let's get it done today.

goldbergtatyana commented 8 years ago

Yep, I was opening the file with http://ss64.com/bash/less.html and http://ss64.com/bash/more.html. It looked alrigh to me. Thanks @semioniy for uploading 👍

MatthiasBaur commented 8 years ago

I ran our classifier on 10k datapoints and got ~20 %. The small drop is probably due to an unequal distribution of the data across the day (The last 10k are spread across 5 hours). I will recheck this with a better distribution, however I cannot give you a deadline for this, because of an exam Wednesday morning.

Some other relevant command when working with this file:

// sort in descending order
//relevant after preprocessing (e.g. arranging date in first pos, extracting lat/lon)
sort -r file.txt
//get first 10000 entries
head -10000 file.txt >output.txt

And if you want to work with the files in js. Either use streaming ops or split the file. split -b 100M file.txt split_file.txt. And then cat them later.

goldbergtatyana commented 8 years ago

thats good news @MatthiasBaur , viel Erfolg on Wed!

semioniy commented 8 years ago

People are asking serious questions here on Kaggle. Questions I have no answers to. Need help. @bensLine maybe you know why it is as it is?

P.S. Datadump has been uploaded.

sacdallago commented 8 years ago

@MatthiasBaur maybe also try leaving one of the data sources out? We don't know how much a data source is accurate.. So, do the 10K sampling, and then filter out by data source?

Just proposing :)

sacdallago commented 8 years ago

@goldbergtatyana we are getting noticed on kaggle :) Reddit all over again :D

goldbergtatyana commented 8 years ago

Oh wow, what an activity on kaggle: 1,856 views · 339 downloads · 5 kernels · 3 topics

@PokemonGoers/predictpokemon-2 how come there is no Monday as a possible instance for the feature appearedDayOfWeek, instead there is a day called"dummy_day" ?

goldbergtatyana commented 8 years ago

@semioniy I expanded the description of the dataset in https://docs.google.com/document/d/1dIKvxOshOCnu2by5gIQR3rceUAs_4cfhg3TDxS9bPnM/edit

Please have a look, replace the red text (this is the info I dont know/unsure about) and copy/paste the description to the kaggle page. Having the data documented as precisely as possible will help to avoid many questions upfront :)

@MatthiasBaur @gyachdav @sacdallago please feel free to also review the file

goldbergtatyana commented 8 years ago

One of the questions from kaggle is very interesting, indeed, and I also cannot answer it. It goes:

Hmm. I see that the 9th line in the arff file has pokemonid 35, but class is 19. Next line is just the other way around. And these are just the first of the discrepancies... ?

@PokemonGoers/predictpokemon-2 this should not be the case, of course! Can one of you guys check whats going on with lines 9 and 10 of the arff file?

semioniy commented 8 years ago

@goldbergtatyana it's not only in these lines. This mistake repeats multiple times over the dataset.

goldbergtatyana commented 8 years ago

hopefully, the question of pokemon ids and class ids not matching in some lines of the arff file will be answered very soon, @semioniy 🙏

bensLine commented 8 years ago

I'll check it tomorrow. Concerning the dummy day, this is a unresolved bug

bensLine commented 8 years ago

I added a PR #65 for the Bug. However, I'm not yet sure if this was the only reason, but definitely one. If so, all poke ids should be wrong in the data set after the first wrong label. I'll check the kaggle file to hopefully confirm that :p

PokemonGoers / PredictPokemon-2

Upload data set into Kaggle #35

And now to this semester's surprise challenge: