PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

Dataset consistency #229

Open ymahlich opened 1 week ago

ymahlich commented 1 week ago

Downloading the data from figshare I realized that some of the dataset files are tab separated, some comma separated, some compressed, some uncompressed.

Is there a specific reason for that? I understand that everything is handled internally so as long as I am directly interacting with the data through coderdata objects, it might not be important but maybe we want to be consistent about this?

As far as I can tell, currently the "schema" behind what datatype is which format is as follows:

data type csv csv.gz tsv.gz
copy_number x
drug_descriptor x
drugs x
experiments x
mutations x
proteomics x
samples x
transcriptomics x
sgosline commented 1 week ago

Yes! First off, pandas can read in whatever so it doesn't matter. I default to csv unless it has drug information, which requires tabs (because drug descriptors and names have commas and quotes in them). we could probably gzip the samples files for consistency.

jjacobson95 commented 5 days ago

I like that we can preview the samples files on figshare when not gzipped, but I don't have a strong opinion either way.

This could very easily be modified in lines 321,323 of build_all.py during the validation step.