new dataset request: Add drug perturbation data from CMAP

PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.

Other

11 stars 3 forks source link

new dataset request: Add drug perturbation data from CMAP #47

Closed BelindaBGarana closed 5 months ago

BelindaBGarana commented 7 months ago

It would be great if the following data set could be added to IMPROVE:

Cmap drug perturbation data (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE101406 from https://www.sciencedirect.com/science/article/pii/S2405471218301091?via%3Dihub and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5990023/)

Thanks in advance!

sgosline commented 7 months ago

@BelindaBGarana can you maybe provide a desired format for the data? Currently we have the input data, a drug file, and an 'experiments' table that maps the dose response. What would the 'experiments' file data be in your approach?

BelindaBGarana commented 7 months ago

@sgosline It would be great if additional .csv.gz data files be added for each type of drug perturbation data (L1000, P100, and GCP in this case). Each file could contain columns with the improve_sample_id, improve_drug_id, entrez_id, level 2 data (e.g., L1000_level_2), level 3 data, and level 4 data. My understanding is that this data was collected based on one concentration and time point for each drug.

sgosline commented 7 months ago

I'm not sure what "perturbation" data is, but do we need separate files, or can we append into a single file with distinct 'source' and 'study' columns differentiating between the acronyms (akin to our current experiments.csv file)

sgosline commented 6 months ago

Proposed schema (perturbations.csv): entrez_id improve_sample_id data_value: [number] data_type: [transcriptomics] perturbation: [improve_drug_id] perturbation_type: [drug, gene]