Open sgosline opened 3 months ago
I could build this into its own function or class, however I think this could get redundant / confusing for users as this can actually already be done using the following commands:
import coderdata as cd
depmap = cd.DatasetLoader('depmap')
mpnst = cd.DatasetLoader('MPNST')
cptac = cd.DatasetLoader('cptac')
beataml = cd.DatasetLoader('beataml')
hcmi = cd.DatasetLoader('hcmi')
joined_data = cd.join_datasets(beataml,hcmi,cptac,depmap,mpnst)
joined_data.transcriptomics # all transcriptomics data
joined_data.proteomics # all proteomics data
joined_data.drugs # all drug data
joined_data.samples # all sample data
# ... etc
yes, but this assumes that people understand (and care) about the shorthand dataset names. For deep learning, they just need to know what type of data it is, and how much there is. How about you rename DatasetLoader
to something like data_by_source
and create a new function called data_by_type
that includes 'transcriptomics', 'proteomics,' 'dose_response','perturbation','copy_number','mutations', etc. They can exist side by side.
you can add a sources
and data_types
function as well so that users can determine what to choose from. Then the above calls just become:
import coderdata as cd
sources = cd.sources
ds = {}
for so in srouces:
ds[so] = cd.data_by_source(so)
joined_data = cd.join_data_by_source(ds.values())
Do you have ad ocument describing the general users and use cases of the package?
Okay will do. There is a general usage page in the docs but I haven't gotten a chance to update with the use cases - I'd like to directly link our tutorials to the docs but haven't had the time to do so yet. It takes quite a few extra steps with the CI blocked.
Usage and use cases are not the same thing - use cases are the start of a design document that motivate the choices made in software development. Generally a good thing to have on hand to make detailed design decisions.
I didn't know about that - I'll add that as an issue.
No need, it's not really a thing that can be fixed in the code base, just something that'll need to be done ahead of the paper/pub.
Shouldn't we keep track of if it as we will eventually need to add it to the docs?
docs are for end users, they do not need to know how/why the software was designed as it was. Use cases/specifications are for developers so they can make informed implementation choices. I believe there are some github features to incorporate the full software engineering process, but i think that ship has sailed at this point :)
It'd be nice to get ALL data of a particular data type (transcriptomics, proteomics, etc.) regardless of source. Can you add a function to do this? We can also filter by source after the fact.