campioneio / data.campione.io

2 stars 0 forks source link

Interesting datasets #1

Open frantzmiccoli opened 9 years ago

frantzmiccoli commented 9 years ago

Let's leave here all the datasets that might be interesting to add before the first release of the project. If you can please elaborate more than simply giving an URL. Why do you think this dataset is interesting ? e.g.: interesting topic, sparse data for a subject where it is uncommon, large dataset for scaling, easily understandable problem...

Please ensure that the dataset is compliant with the principles mentioned in CONTRIBUTE.md:

frantzmiccoli commented 9 years ago

The Iris dataset, mostly for its historical value. http://archive.ics.uci.edu/ml/datasets/Iris

frantzmiccoli commented 9 years ago

The German credit scoring dataset, again a classic https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

frantzmiccoli commented 9 years ago

The Yelp dataset which has been quite trending and provides recente and complete data http://www.yelp.com/dataset_challenge there's no need of a login / password but you have to feel a form... some discussion may be required about that.

frantzmiccoli commented 9 years ago

The Million Song Dataset which clearly looks awesome (a lot of content, various and clearly described fields) http://labrosa.ee.columbia.edu/millionsong/

albahnsen commented 9 years ago

Why not saving the datasets directly in a compatible format?

Look in here, I follow the same logic as sklearn to import datasets which are stored as a compress file for example, and of course there is a description of each set.

albahnsen commented 9 years ago

Similar to the German credit scoring data:

http://sede.neurotech.com.br:443/PAKDD2009/ the data is available here: http://cse652fall2011.wikispaces.com/file/view/Training%20Data.txt/264287924/Training%20Data.txt http://cse652fall2011.wikispaces.com/file/view/Testing%20Data.txt/264287998/Testing%20Data.txt

albahnsen commented 9 years ago

What about Kaggle sets? they require registration but are freely available.

albahnsen commented 9 years ago

I used this one for one of my papers http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Nevertheless, I had to do quite a lot of preprocessing.

frantzmiccoli commented 9 years ago

@albahnsen I have created #3 to discuss about the compatibility thing. I think this will be an interesting topic for later.

About Kaggle, I have found some datasets that were registration free, my point is more are they removed after the challenge? I have found a few past challenge data but I don't know the rule if there is any.

Thanks for the provided datasets, I will start to make related issues.

frantzmiccoli commented 9 years ago

@albahnsen I have tried to give a look to: http://sede.neurotech.com.br:443/PAKDD2009/ unfortunately it is down.