Open sgosline opened 2 weeks ago
Dataset
class attributesAs per 2ec2c62 the Dataset
object should (can?) contain:
transcriptomics
mutations
copy_numbers
proteomics
experiments
combinations
drugs
genes
samples
Currently the DataLoader
object (to be refactored into the Dataset
object) also contains:
mirna
methylation
metabolomics
full
(? - I am guessing this is a leftover from the idea of creating one "master-table" containing all other information?)It is missing:
combinations
Do we want keep the additional attributes (and add combinations
) or should we remove them and add them back as needed / when we have datasets that include those data types?
The way data ingestion is currently implemented is the 'loader' checks if the data_type
descriptor that is in the file name also is an attribute in the DataLoader
Object. If it is, the loader imports the contents of the file and stores it in the object.
For example: Assuming we downloaded all BeatAML files from figshare and load beataml
(data = DataLoader('beataml')
the loader will find a file called beataml_drugs.tsv.gz
. The loader then extracts from said file name that the data_type
should be drugs
, "sees" that there is a drugs
attribute in DataLoader
and therefore imports the data file and stores it in DataLoader.drugs
.
The problem with that is that as of v.0.1.4 we have also files like beataml_drug_descriptors.tsv.gz
which should be imported into DataLoader.drug_descriptor
(I assume). That attribute doesn't exist as of now and therefore is NOT imported. Is this an oversight? Was that something that @jjacobson95 was planning on implementing but hadn't gotten to?
A few questions:
1- on what to import, i'd say we keep everything (except for full
) and add in combinations. BUT only allow people to download what is available.
2- I'm not sure what to do about drug_descriptors. I feel like that should be loaded with the drugs, but can be open to adding another argument to the loader.
Just for clarification:
> coderdata download [--prefix NAME]
) which downloads everything on figshare (or the subset that shares the defined --prefix
). I will be implementing a way to download via the API as well - I haven't looked into the downloader code of @jjacobson95 yet but I am assuming that I will be able to repurpose a lot and then just wrap that function for the CLI.
Do you want to also be able to retrieve only specific data_type
(s)? E.g. cd.download(directory=cwd, prefix='all', data_type='all')
and > coderdata download [--prefix NAME] [--data_type DTYPE]
for the API call and CLI respectively where data_type
/ --data_type
would be used to define that we only want let's say samples
.drug_descriptors
attribute to Dataset
, that would then automatically be populated during Dataset.load
if a [dataset]_drug_descriptors.tsv.gz
it available.First off: please remove the prefix
argument. It's hard to interpret and also not specific. Please use the dataset
to describe the dataset.
data_type
argument if you choose, but default to all
.
We discussed some basic alterations to the python functions that are described in the
doc-update
branch here: https://github.com/PNNL-CompBio/coderdata/blob/doc-update/README.mdThese changes entail:
dataset
list
,download
, andload
for the CoderData packagedataset
object including:train_test_validate
(already exists),save
,types
andformat
This will resolve #228 and #229, which I will close as duplicates.