Possible datasets - Githubissues

k8hertweck commented 4 years ago

Data specifications:

previously published, and available online in a format that can be downloaded (modeling data provenance)
associated published paper, useful for framing research questions/conclusions, and allowing reproducibility practices
allows multiple types of ML algorithm application

k8hertweck commented 3 years ago

Glaucoma diagnosis https://datadryad.org/stash/dataset/doi:10.5061/dryad.q6ft5
Triple negative breast cancer imaging https://datadryad.org/stash/dataset/doi:10.5061/dryad.32765
Immunohistochemical typing of adenocarcinomas https://datadryad.org/stash/dataset/doi:10.5061/dryad.g8h71
computer aided diagnosis in breast and prostate cancer https://datadryad.org/stash/dataset/doi:10.5061/dryad.m5n98

Other sources for data:

k8hertweck commented 3 years ago

Note: all datasets in comment above are from Data Dryad, which is a repository for data published with scientific manuscripts. They don't have expectations about how the data are structured for archival, so some are more accessible (e.g., easily downloadable and ready to analyze) than others. If we decide to use a dataset from there (or elsewhere) that isn't quite ready to analyze (e.g., is in an excel spreadsheet), we can do some basic data conversion/cleaning and add those data to our repo (with a README, of course).

k8hertweck commented 3 years ago

For context, here are the datasets used for the machine learning courses:

Concepts in Machine Learning: synthetic cardiovascular risk dataset (clinical and genomic data) https://github.com/laderast/cvdRiskData
Intermediate Python and R Machine Learning: Glaucoma diagnosis (clinical and demographic data) https://datadryad.org/stash/dataset/doi:10.5061/dryad.q6ft5, TBD assay dataset (laboratory experimental results)

Potential datasets for this project:

classification of pediatric lymphoblastic leukemia (gene expression and limited clinical data) https://www.stjuderesearch.org/site/data/ALL1/
The Child Health and Development Studies Original Cohort (CHDS OC, demographic and clinical data) https://dash.nichd.nih.gov/study/8
ISIC skin lesions (images, clinical, demographic data) https://www.isic-archive.com/#!/onlyHeaderTop/gallery?filter=%5B%5D (also see repo to assist download to python here: https://github.com/GalAvineri/ISIC-Archive-Downloader)
bone marrow transplant in children (clinical) https://archive.ics.uci.edu/ml/datasets/Bone+marrow+transplant%3A+children
pan-cancer gene expression (genomic) https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq#

Assessment criteria

allows variety of data analysis questions to be addressed (EDA, data cleaning, ML algorithms)
research questions are interesting and motivating to biomedical researchers (including assessment of health disparities or other types of social good questions)
associated with previously published research (data analysis documentation allows assessing reproducibility)
data provenance and documentation (e.g., data dictionary, metadata) are robust and complete
downloadable in a readily-usable format (e.g., preferably csv, but could also be xls)
data are of a type still relevant to current experimental methods

Rate each criteria on the following scale: 1 - slightly or not accurate 2 - moderately accurate 3 - very accurate

Feel free to include any additional notes about the dataset you think would be relevant.

MatthewCodes commented 3 years ago

Here is my ranking after viewing all of the datasets. Dataset 1) 4th Choice 2, 2, 1, 3, 3, 3

Dataset 2) 5th Choice 1, 1, 1, 1, 1, 1

Dataset 3) 1st Choice 3, 3, 1, 3, 3, 3

Dataset 4) 3rd Choice 3, 3, 3, 1, 1, 3

Dataset 5) 2nd Choice 3, 1, 2, 3, 3, 3

Anmol-Srivastava commented 3 years ago

D1] 2, 2, 3, 2, 2, 2 D2] 2, 2, 3, 2, 1, 2 D3] 3, 3, 2, 3, 3, 3 D4] 1, 3, 1, 3, 3, 2 D5] 1, 3, 1, 3, 3, 2

I'm particularly excited about the skin legions data (both because it is an image set and because interesting analyses may sprout thereof), and also the child health and development data (notably, because it is very large and can also lead in diverse directions). I think downloading many of these options may prove to be a little cumbersome or tricky, but not enough to disqualify any of them immediately, although after accounting for their smaller sample sizes and less associated research, the remaining sets are lower priority for me.

k8hertweck commented 3 years ago

Selected datasets:

ISIC skin lesions (images, clinical, demographic data) https://www.isic-archive.com/#!/onlyHeaderTop/gallery?filter=%5B%5D (also see repo to assist download to python here: https://github.com/GalAvineri/ISIC-Archive-Downloader)
pan-cancer gene expression (genomic) https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq#

k8hertweck commented 3 years ago

Adding this for future reference:

Clinical data and metadata: https://datadryad.org/stash/dataset/doi:10.5061/dryad.r36cn90

fredhutchio / practice-machine-learning

Possible datasets #1

Potential datasets for this project:

Assessment criteria