PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

new feature: download all data by type #111

Open sgosline opened 3 months ago

sgosline commented 3 months ago

It'd be nice to get ALL data of a particular data type (transcriptomics, proteomics, etc.) regardless of source. Can you add a function to do this? We can also filter by source after the fact.

jjacobson95 commented 3 months ago

I could build this into its own function or class, however I think this could get redundant / confusing for users as this can actually already be done using the following commands:

import coderdata as cd

depmap = cd.DatasetLoader('depmap')
mpnst = cd.DatasetLoader('MPNST')
cptac = cd.DatasetLoader('cptac')
beataml = cd.DatasetLoader('beataml')
hcmi = cd.DatasetLoader('hcmi')

joined_data = cd.join_datasets(beataml,hcmi,cptac,depmap,mpnst)

joined_data.transcriptomics # all transcriptomics data
joined_data.proteomics # all proteomics data
joined_data.drugs # all drug data
joined_data.samples # all sample data
#  ... etc
sgosline commented 3 months ago

yes, but this assumes that people understand (and care) about the shorthand dataset names. For deep learning, they just need to know what type of data it is, and how much there is. How about you rename DatasetLoader to something like data_by_source and create a new function called data_by_type that includes 'transcriptomics', 'proteomics,' 'dose_response','perturbation','copy_number','mutations', etc. They can exist side by side.

you can add a sources and data_types function as well so that users can determine what to choose from. Then the above calls just become:

import coderdata as cd
sources = cd.sources
ds = {}
for so in srouces:
    ds[so] = cd.data_by_source(so)
joined_data = cd.join_data_by_source(ds.values())

Do you have ad ocument describing the general users and use cases of the package?

jjacobson95 commented 3 months ago

Okay will do. There is a general usage page in the docs but I haven't gotten a chance to update with the use cases - I'd like to directly link our tutorials to the docs but haven't had the time to do so yet. It takes quite a few extra steps with the CI blocked.

sgosline commented 3 months ago

Usage and use cases are not the same thing - use cases are the start of a design document that motivate the choices made in software development. Generally a good thing to have on hand to make detailed design decisions.

jjacobson95 commented 3 months ago

I didn't know about that - I'll add that as an issue.

sgosline commented 3 months ago

No need, it's not really a thing that can be fixed in the code base, just something that'll need to be done ahead of the paper/pub.

jjacobson95 commented 3 months ago

Shouldn't we keep track of if it as we will eventually need to add it to the docs?

sgosline commented 3 months ago

docs are for end users, they do not need to know how/why the software was designed as it was. Use cases/specifications are for developers so they can make informed implementation choices. I believe there are some github features to incorporate the full software engineering process, but i think that ship has sailed at this point :)