PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

Update python functions to adhere to simpler standards and pre-format data #246

Open sgosline opened 2 weeks ago

sgosline commented 2 weeks ago

We discussed some basic alterations to the python functions that are described in the doc-update branch here: https://github.com/PNNL-CompBio/coderdata/blob/doc-update/README.md

These changes entail:

  1. renaming DatasetLoader to just dataset
  2. creating functions to list,download, and load for the CoderData package
  3. creating functions of the new dataset object including: train_test_validate (already exists), save, types and format

This will resolve #228 and #229, which I will close as duplicates.

ymahlich commented 1 week ago

Clarification on Dataset class attributes

As per 2ec2c62 the Dataset object should (can?) contain:

Currently the DataLoader object (to be refactored into the Dataset object) also contains:

It is missing:

Do we want keep the additional attributes (and add combinations) or should we remove them and add them back as needed / when we have datasets that include those data types?

Existing data sets on figshare that currently don't get imported:

The way data ingestion is currently implemented is the 'loader' checks if the data_type descriptor that is in the file name also is an attribute in the DataLoader Object. If it is, the loader imports the contents of the file and stores it in the object.

For example: Assuming we downloaded all BeatAML files from figshare and load beataml (data = DataLoader('beataml') the loader will find a file called beataml_drugs.tsv.gz. The loader then extracts from said file name that the data_type should be drugs, "sees" that there is a drugs attribute in DataLoader and therefore imports the data file and stores it in DataLoader.drugs. The problem with that is that as of v.0.1.4 we have also files like beataml_drug_descriptors.tsv.gz which should be imported into DataLoader.drug_descriptor (I assume). That attribute doesn't exist as of now and therefore is NOT imported. Is this an oversight? Was that something that @jjacobson95 was planning on implementing but hadn't gotten to?

sgosline commented 1 week ago

A few questions:

1- on what to import, i'd say we keep everything (except for full) and add in combinations. BUT only allow people to download what is available. 2- I'm not sure what to do about drug_descriptors. I feel like that should be loaded with the drugs, but can be open to adding another argument to the loader.

ymahlich commented 1 week ago

Just for clarification:

  1. Download (currently) is handled via the command line (i.e. > coderdata download [--prefix NAME]) which downloads everything on figshare (or the subset that shares the defined --prefix). I will be implementing a way to download via the API as well - I haven't looked into the downloader code of @jjacobson95 yet but I am assuming that I will be able to repurpose a lot and then just wrap that function for the CLI. Do you want to also be able to retrieve only specific data_type(s)? E.g. cd.download(directory=cwd, prefix='all', data_type='all') and > coderdata download [--prefix NAME] [--data_type DTYPE] for the API call and CLI respectively where data_type / --data_type would be used to define that we only want let's say samples.
  2. The "simplest" thing to do is just add a drug_descriptors attribute to Dataset, that would then automatically be populated during Dataset.load if a [dataset]_drug_descriptors.tsv.gz it available.
sgosline commented 1 week ago

First off: please remove the prefix argument. It's hard to interpret and also not specific. Please use the dataset to describe the dataset.

  1. You can definitely add the data_type argument if you choose, but default to all.
  2. Can you identify the use case in which someone would want the drug information without the descriptors? If not just download them both at once.