fredhutchio / practice-machine-learning

suggestions to practice machine learning for biomedical applications
3 stars 0 forks source link

Possible datasets #1

Closed k8hertweck closed 3 years ago

k8hertweck commented 4 years ago

Data specifications:

k8hertweck commented 3 years ago

Other sources for data:

k8hertweck commented 3 years ago

Note: all datasets in comment above are from Data Dryad, which is a repository for data published with scientific manuscripts. They don't have expectations about how the data are structured for archival, so some are more accessible (e.g., easily downloadable and ready to analyze) than others. If we decide to use a dataset from there (or elsewhere) that isn't quite ready to analyze (e.g., is in an excel spreadsheet), we can do some basic data conversion/cleaning and add those data to our repo (with a README, of course).

k8hertweck commented 3 years ago

For context, here are the datasets used for the machine learning courses:

Potential datasets for this project:

  1. classification of pediatric lymphoblastic leukemia (gene expression and limited clinical data) https://www.stjuderesearch.org/site/data/ALL1/
  2. The Child Health and Development Studies Original Cohort (CHDS OC, demographic and clinical data) https://dash.nichd.nih.gov/study/8
  3. ISIC skin lesions (images, clinical, demographic data) https://www.isic-archive.com/#!/onlyHeaderTop/gallery?filter=%5B%5D (also see repo to assist download to python here: https://github.com/GalAvineri/ISIC-Archive-Downloader)
  4. bone marrow transplant in children (clinical) https://archive.ics.uci.edu/ml/datasets/Bone+marrow+transplant%3A+children
  5. pan-cancer gene expression (genomic) https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq#

Assessment criteria

  1. allows variety of data analysis questions to be addressed (EDA, data cleaning, ML algorithms)
  2. research questions are interesting and motivating to biomedical researchers (including assessment of health disparities or other types of social good questions)
  3. associated with previously published research (data analysis documentation allows assessing reproducibility)
  4. data provenance and documentation (e.g., data dictionary, metadata) are robust and complete
  5. downloadable in a readily-usable format (e.g., preferably csv, but could also be xls)
  6. data are of a type still relevant to current experimental methods

Rate each criteria on the following scale: 1 - slightly or not accurate 2 - moderately accurate 3 - very accurate

Feel free to include any additional notes about the dataset you think would be relevant.

MatthewCodes commented 3 years ago

Here is my ranking after viewing all of the datasets. Dataset 1) 4th Choice 2, 2, 1, 3, 3, 3

Dataset 2) 5th Choice 1, 1, 1, 1, 1, 1

Dataset 3) 1st Choice 3, 3, 1, 3, 3, 3

Dataset 4) 3rd Choice 3, 3, 3, 1, 1, 3

Dataset 5) 2nd Choice 3, 1, 2, 3, 3, 3

Anmol-Srivastava commented 3 years ago

D1] 2, 2, 3, 2, 2, 2 D2] 2, 2, 3, 2, 1, 2 D3] 3, 3, 2, 3, 3, 3 D4] 1, 3, 1, 3, 3, 2 D5] 1, 3, 1, 3, 3, 2

I'm particularly excited about the skin legions data (both because it is an image set and because interesting analyses may sprout thereof), and also the child health and development data (notably, because it is very large and can also lead in diverse directions). I think downloading many of these options may prove to be a little cumbersome or tricky, but not enough to disqualify any of them immediately, although after accounting for their smaller sample sizes and less associated research, the remaining sets are lower priority for me.

k8hertweck commented 3 years ago

Selected datasets:

k8hertweck commented 3 years ago

Adding this for future reference:

Clinical data and metadata: https://datadryad.org/stash/dataset/doi:10.5061/dryad.r36cn90