datahubio / datahub-v2-pm

Project management (issues only)
8 stars 2 forks source link

Machine Learning Datasets - Part I #88

Closed rufuspollock closed 6 years ago

rufuspollock commented 6 years ago

One large potential user group for DataHub are people working in data science and machine learning

Question: is there a difference between machine learning and data science? Is ML only about neural net stuff or does it include classic predictive analytics ranging from regression to random forests. My sense is that we can go with ML even what we are talking about is a bit broader.

As someone starting learning data science (and machine learning) I want good ready-to-use sample datasets I can use for practice so that I can focus on practising analytics rather than data wrangling

As a more advanced student of machine learning I want to get a wide range of well-prepared datasets (including well known ones) that I can practise on so that I can improve and focus my efforts on learning not data acquisition

As a Machine Learning practitioner I want to find up to date datasets which I can use for implementing newest classificators so that I can contribute to machine learning community or create projects for company I work in.

Please add to these

Acceptance criteria

Tasks

Analysis

svetozarstojkovic commented 6 years ago

If some OpenML dataset has source on UCI, I am using UCI dataset.

Datasets that will be put into datahub are:

  1. https://archive.ics.uci.edu/ml/datasets/seismic-bumps - Seismic bumps Most attention I payed on medical datasets...
  2. https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29 - Cervical cancer
  3. https://archive.ics.uci.edu/ml/datasets/Fertility - Fertility
  4. https://www.openml.org/d/13 - breast cancer data
  5. https://www.openml.org/d/171 - primary tumor
  6. https://www.openml.org/d/10 - lymph
  7. https://www.openml.org/d/35 - dermatology
  8. https://www.openml.org/d/55 - hepatitis

Next datasets are organized by "most runs" ordering on OpenML...

  1. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) - German Credit Data Set
  2. https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center - Blood Transfusion
  3. https://archive.ics.uci.edu/ml/datasets/MONK's+Problems - MONK's Problems Data Set
  4. https://archive.ics.uci.edu/ml/datasets/Tic-Tac-Toe+Endgame - Tic-Tac-Toe Endgame Data Set
  5. https://www.openml.org/d/1471 - eeg-eye-state
  6. https://www.openml.org/d/40536 - speed dating

Github repos:

  1. https://github.com/datasets/seismic-bumps
  2. https://github.com/datasets/cervical-cancer
  3. https://github.com/datasets/fertility
  4. https://github.com/datasets/breast-cancer
  5. https://github.com/datasets/primary-tumor
  6. https://github.com/datasets/lymph
  7. https://github.com/datasets/dermatology
  8. https://github.com/datasets/hepatitis
  9. https://github.com/datasets/eeg-eye-state
  10. https://github.com/datasets/speed-dating

Datahub user:

rufuspollock commented 6 years ago

@svetozarstojkovic could you give a brief reason so that people know why (esp as we suggested going with openml by default :wink: - i'm very happy you went with this but just say why helps others who might work on this)

svetozarstojkovic commented 6 years ago

Most of the datasets I found on OpenML had source on UCI, so I just went on UCI and used theirs datasets, those which didn't had UCI source I am using OpenML.

zelima commented 6 years ago

@rufuspollock anything remaining on this except blog post, can we close? @Mikanebu can you take on blog post?

Mikanebu commented 6 years ago

@zelima ok

zelima commented 6 years ago

@Mikanebu any progress here?

Mikanebu commented 6 years ago

@zelima I have not started yet writing blog post. I will add it in my next24

rufuspollock commented 6 years ago

Is this now a DUPLICATE of https://github.com/datahq/datahub-qa/issues/33?

zelima commented 6 years ago

FIXED/DUPLICATE. Think as a Part I this is done. The blog post will come with Part II if such will be needed. As a part of this issue, we've got the post about arrf here https://datahub.io/blog/attribute-relation-file-format-arff