Closed gyachdav closed 8 years ago
Status?
Status - still busy with weather)
For the sightings data: I am creating dump of the production DB which collected over the last 3 days (almost 6 gigs of data).
How is it on your end @semioniy ??
@sacdallago we now have the 50k dataset with all current features.
Oh nice! :)
Now the question is only, how Kaggle works, and how fast they answer me.
BTW, @gyachdav @sacdallago, when kaggle answers me about making a challenge, should I promise some kind of prize? As I understood, that is the sense of Kaggle, finding a better method and winning a prize.
No prize. There are many challenges that are there for fame and glory.
Sent from my iPhone
On Sep 26, 2016, at 12:35 PM, semioniy notifications@github.com wrote:
BTW, @gyachdav @sacdallago, when kaggle answers me about making a challenge, should I promise some kind of prize? As I understood, that is the sense of Kaggle, finding a better method and winning a prize.
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
you'll update us on the news from kaggle, right @semioniy ? ๐
Yeah, sure. But there' not much to say though. They contacted me, asking what's the project about and what we want. I answered them and now wait for their reaction. Bad thing, that we only contact each other via email, so the conversation is pretty slow.
Please email Kaggle again and ask for a status update. please put us (mentors) in cc. Thanks.
On Sep 29, 2016, at 7:54 PM, semioniy notifications@github.com wrote:
Yeah, sure. But there' not much to say though. They contacted me, asking what's the project about and what we want. I answered them and now wait for their reaction. Bad thing, that we only contact each other via email, so the conversation is pretty slow.
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@gyachdav we emailed today, so there is probably no status update yet. When they write me next time, I'll put you in cc.
I uploaded and made it ready, maybe they just need some time.
Thanks @semioniy. The kaggle page looks good and I especially like your description to the challenge. What I am missing though is the description of the data points and their features, i.e. what were the sources used to get the data and what are the possible values of our features?
Also please note that a new dataset should be uploaded, which is the one mined from our own team A @PokemonGoers/pokedata . Dataset is on its way trough something similar to UPS in the form of our magnificent @goldbergtatyana :laughing:
Thanks @sacdallago for creating the file!
The json file with the copy of our MongoDB can now be downlaoded from https://rostlab.org/~goldberg/catchemall.json
@semioniy please make a new csv file with features for the data from this file and upload it on kaggle @MatthiasBaur please go ahead and use this data to verify the prediction performance of the algorithm you found to provide best results
The file is ~2G in size and contains ~9M entries.
It can be downloaded to your linux machine directly using
wget https://rostlab.org/~goldberg/catchemall.json
Hey, @goldbergtatyana, @sacdallago, first you told me to upload a dataset with as many features as possible. But now it's eather all features, or 9M dataset. If I upload a 9M dataset, it's only possible with less features, which basically makes it an other dataset, not a New Version.
P.S. *because I can't gather weather for 9Mil entries I mean, and with different features it's a different dataset.
@semioniy thanks for the update. Can you not upload this as an additional dataset, as is? Not necessary to calculate the features for these, for now -at least, not for the kaggle-. You are totally correct: this should not replace the existing dataset, it is not meant as a new version, simply because most of these points are not verifiable and we actually have knowledge that some are purely statistical with no relevance (though we don't know which ones).
We can sample some of the points in this data and calculate features for those, in case. Internally. Additionally this data presents a source
label, which might help figuring out which services are reliable and which ones are not. And I guess, that's what @MatthiasBaur is having fun with ATM.
Lots of fun :)
ok just to repeat, @semioniy please upload the new file as an additional file for sightings. Just please add descriptions for both files.
@MatthiasBaur we need to sample features for 10K randomly drawn datapoints from this file (preferably of the last day, thats what you did on the prev data set, right?), but only those features that you found to be useful for the reddit data set. Then we nee dto apply 10 fold cv to confirm the prediction performance of ~21% on this data set as well.
btw, @MatthiasBaur and @semioniy since we cannot predict rare pokemons anyways (too little data and they seem not to follow appearance patterns of other pokemons) then we dont need to combine them in one class, why the extra work?
@semioniy @semioniy please upload the new dataset to kaggle, including descriptions of the data sets. Let us know if you need help. Thank you!
@goldbergtatyana sorry, had problems with internet in my dorms, now online again. I can't download the dataset, can you upload it somewhere else? Dropbox would be perfect.
P.S do I understand right?: Now I'm responsible for Kaggle page. People keep asking questions there)
That's cool @semioniy, here is the link https://www.dropbox.com/s/5js9vvsgerph0pi/catchemall.json?dl=0
Hi, @goldbergtatyana, This file is not much of a use, cause neither webstorm, nor any text editor can handle it. 9Mil seems to be too much. Maybe, if it were 9 files a million entries each, I could convert them, but still I doubt there would be a way to merge them together not using some server-guys-tricks (which I don't know). As is, there is not much sense in uploading the file, I think, cause it's just coordinates, date and class, isn't it?
Hey @semioniy , there is no need to open the file with a text editor as long as we provide descripton for the file and maybe the first ten lines of the file as a sample ๐
Please go ahead and upload the whole file as is. The file will then be read programatically or with linux tools.
@semioniy
a) If using win best new feature of windows 10: LINUX
b) How to check out file contents
Nice @sacdallago , thanks a lot for the hints ๐
@MatthiasBaur can you confirm the same prediction performance of ~21% on randomly chosen 10K data points from the big set?
Ah, ah. Since this is a js seminar. You can do the following too:
node
from within the folder with that file (let's suppose it's called file.json
).var t = require('file');
> console.log(t[0]); // First element
> var last_element = t[t.length - 1]; // Last element
> console.log(last_element );
> for (var i = 0; i < 10; i++) { console.log(t[i]); } // First 10 elements
> var item = t[Math.floor(Math.random()*t.length)]; // Random element
if you write a js script that randomly draws x elements from the array (with or without substitution), you can even build your own 10K dataset.
No need to ever use a text editor :)
@sacdallago thanx for advice about node. After minor changes ('./file'
instead of 'file'
) it worked, but only with smaller files. Located in the same folder, catchemall.json
doesn't even get imported after require. I get an error:
Error: toString failed
I assume it's because of the file size, but, maybe, the file is corrupted. Did you, or @goldbergtatyana try to open it yourself?
I'll try sublime now and write, how it went.
Update
It worked with sublime, now I'll upload the file. BTW yes, this could not be read as an array because [
and ]
were missing.
@semioniy check the upload restrictions on Kaggle
Aaand, this is what Kaggle says when uploading the file:
The zip file 'catchemall.zip' contains a file 'catchemall.json' (1.94 GB) that exceeds the max size allowed of 500.00 MB.
yeah thought so... just split it up then to several 500MB files.
on debian/ubuntu you can:
split --bytes=500M /path/to/catchemall.json
@gyachdav won't this deteriorate the data formatting? @semioniy you might need to look at the end and beginnings of files, making sure objects are closed and not truncated and that the array form is kept consistent, aka:
END OF FILE 1
...}, {"timestamp": "UT
BEGGINING OF FILE 2
C-xxx", "some attribute": "some value",..},...
TO
...}]
[{"timestamp": "UTC-xxx", "some attribute": "some value",..},...
Or you can just have in the readme instructions how to rebuild the files (using join). Either way works. let's get it done today.
Yep, I was opening the file with http://ss64.com/bash/less.html and http://ss64.com/bash/more.html. It looked alrigh to me. Thanks @semioniy for uploading ๐
I ran our classifier on 10k datapoints and got ~20 %. The small drop is probably due to an unequal distribution of the data across the day (The last 10k are spread across 5 hours). I will recheck this with a better distribution, however I cannot give you a deadline for this, because of an exam Wednesday morning.
Some other relevant command when working with this file:
// sort in descending order
//relevant after preprocessing (e.g. arranging date in first pos, extracting lat/lon)
sort -r file.txt
//get first 10000 entries
head -10000 file.txt >output.txt
And if you want to work with the files in js. Either use streaming ops or split the file.
split -b 100M file.txt split_file.txt.
And then cat
them later.
thats good news @MatthiasBaur , viel Erfolg on Wed!
People are asking serious questions here on Kaggle. Questions I have no answers to. Need help. @bensLine maybe you know why it is as it is?
P.S. Datadump has been uploaded.
@MatthiasBaur maybe also try leaving one of the data sources out? We don't know how much a data source is accurate.. So, do the 10K sampling, and then filter out by data source?
Just proposing :)
@goldbergtatyana we are getting noticed on kaggle :) Reddit all over again :D
Oh wow, what an activity on kaggle: 1,856 views ยท 339 downloads ยท 5 kernels ยท 3 topics
@PokemonGoers/predictpokemon-2 how come there is no Monday as a possible instance for the feature appearedDayOfWeek, instead there is a day called"dummy_day" ?
@semioniy I expanded the description of the dataset in https://docs.google.com/document/d/1dIKvxOshOCnu2by5gIQR3rceUAs_4cfhg3TDxS9bPnM/edit
Please have a look, replace the red text (this is the info I dont know/unsure about) and copy/paste the description to the kaggle page. Having the data documented as precisely as possible will help to avoid many questions upfront :)
@MatthiasBaur @gyachdav @sacdallago please feel free to also review the file
One of the questions from kaggle is very interesting, indeed, and I also cannot answer it. It goes:
Hmm. I see that the 9th line in the arff file has pokemonid 35, but class is 19. Next line is just the other way around. And these are just the first of the discrepancies... ?
@PokemonGoers/predictpokemon-2 this should not be the case, of course! Can one of you guys check whats going on with lines 9 and 10 of the arff file?
@goldbergtatyana it's not only in these lines. This mistake repeats multiple times over the dataset.
hopefully, the question of pokemon ids and class ids not matching in some lines of the arff file will be answered very soon, @semioniy ๐
I'll check it tomorrow. Concerning the dummy day, this is a unresolved bug
I added a PR #65 for the Bug. However, I'm not yet sure if this was the only reason, but definitely one. If so, all poke ids should be wrong in the data set after the first wrong label. I'll check the kaggle file to hopefully confirm that :p
Please generate a Kaggle page for our PokemonGo prediction and turn it into a challenge. For the time being keep the challenge invite only. Please upload your datasets onto the kaggle page.
Here is an example of a Kaggle page that was inspired by our Game of Thrones project and uses the datasets we generated.
And now to this semester's surprise challenge:
Once the page is set up there will be a group from Microsoft Bing's Core Relevance and Ranking that will be invited to the challenge and try offer their own predictions.
We have all confidence that the TUM team will come on top! ๐