Refinebio API Python Client Requirements

kurtwheeler commented 4 years ago

Data Model Classes

All data models should have a corresponding class.

However these models won't be necessary for users to use.
I don't think it's worth implementing yet, but I think it should be easy to use a cache for objects retrieved from the API.
All Model classes should have get and list class method.
- get should default to accession code (for samples/experiments), but optionally take refinebio IDs instead.
  - We'll probably need some logic to detect the accession type and use them appropriately (accession vs alternate_accession)
    - list should accept all the filter parameters the corresponding endpoint does.
Experiments will need some additional functionality:
- search (there's a chance this should just be called list but use the search endpoint)
- get_uploader_contact() will return the contact information for the uploader
Samples:
- get_uploader_contact() should call the corresponding experiment's get_uploader_contact() method.
Datasets will need a number of special methods:
- process() will send the dataset to refine.bio for processing
- check() will see if the dataset has finished processing or not.
- download(path) will download the dataset to disk if it has finished processing

High Level Functions

We'll need a number of high level functions:

help(entity=none)
- if called with no entity, gives high level overview of how to use. If not, prints docstring for method.
download_dataset(dataset_dict=None, experiments=None, samples=None, aggregation="EXPERIMENT", transformation="NONE", skip_quantile_normalization=False)
- download_compendium(organism, path, quant_sf_only=False)
- download_quantfile_compendium(organism, path)
- will just call download_compendium(organism, path, quant_sf_only=True)

API Tokens

We'll need to have good support for the tokens. The following functions will be needed:

create_token(email_address)
agree_to_terms_and_conditions(api_token)
save_token(api_token, file_path=os.getenv("CONFIG_FILE", "~/.refinebio.yaml")
load_token(file_path=os.getenv("CONFIG_FILE", "~/.refinebio.yaml"))

I'd like to do the config file as a YAML file so we can add additional keys to it in the future if that ever becomes necessary. (Off the top of my head I could think of maybe allowing someone to turn on the cache with them if we ever built that.)

Documentation/Error messages

These will be really important. We'll want to thoroughly validate all the parameters to all functions so if any are invalid we can explain it nicely. All classes, methods, and functions should have a docstring that explains all parameters, along with an easy-to-understand description of what it can do and how to use it.

kurtwheeler commented 4 years ago

I'll also leave these use cases written up by @dvenprasad :+1:

API Client Use Cases

1

I’m a bioinformatician that has recently finished up a postdoc and is trying to get out a final publication. I used two publicly available cohorts as validation in my manuscript. One of the reviewers wants me to analyze an additional four data sets that I was unaware of upon submission. Unfortunately, since I’ve left my institution, I no longer have access to the computing resources I need for reprocessing the raw data. I would like all six cohorts to be processed the same way to feel confident in my results and some of the necessary details are missing from the accessions in ArrayExpress.

System Actions

User must be able to search with a list of experiment accession numbers, either from GEO or SRA or ArrayExpress.

2

I'm a bioinformatician that wants to predict lab- or institution-specific effects in C. albicans gene expression data. I need to do some preliminary analyses. I attempt to remove nuisance variables (like strain or platform-specific effects) and then cluster samples. I want to know if the resulting clusters are enriched for different institutions. I need the submitter supplied organization name and associated email address to do this experiment.

System Actions

User must be able to download a matrix of gene expression values for an organism.
User must be able to easily extract the mandatory GEO fields through the API and GUI.

3

I'm an ML researcher and I want to know if building deeper models can tell me about tissue specificity and the interaction between transcription factors in human expression data. I need my input values to be between zero and one and tissue labels for some portion of samples.

System Actions

Users must be able to transform dataset before download.
Also need metadata to be available alongside samples.
Need to be able to download gene expression matrix.

4

I'm a ML researcher that has built a classifier that can predict the presence or absence of a particular mutation in a tumor using gene expression data as input, dropping all near zero variance genes. Now I want to identify cell lines that have similar mutation profiles. Once I find these, I’d like to identify collaborators who have uploaded samples to GEO with such profiles.

System Actions

Users must be able to extract uploader’s contact information via API.

5

I'm a computational biologist that works on P. aeruginosa and I've found a large data set that appears to be an experiment using the antibiotics I'm interested in on GEO. This data set doesn't have a publication associated with it and there's no strain information provided, but this data set could be an important validation set for my work. If I had a processed P. aeruginosa where most samples had strain information, I could cluster the samples from this unlabeled experiment or build a classifier to predict strain.

System Actions *Search by organism.

6

I am a medulloblastoma researcher that is interested in doing a meta-analysis of all available medulloblastoma data. I would like to run CoGAPS and relate the patterns I find back to histology. I’ve explored the data somewhat on using the medulloblastoma data scope on R2 and I’ve noticed the data come from different platforms. It would significantly speed up my research if both the gene expression data and the histology labels were normalized in some way.

System Actions

Search with accession numbers.
Download the data.

7

I am a computational biologist studying osteosarcoma in zebrafish. I have some raw data and I would like to validate it against refine.bio data. I need to process my data similarly to refine.bio.

System actions

Users must be able get QN targets by species
Users must be able to get transcriptome indices by species

8

I am a computational biologist studying hypoxia in zebrafish. For my experiment design, I only require the quant.sf output from salmon.

System Actions

Download quant.sf files by experiment or sample accessions or
Download quant.sf files by organism

9

I'm a ML researcher studying medulloblastoma. I have built a model which predicts if Drug X works on medullobalstoma. I am interested in observing its effects in medulloblastoma adjacent diseases.

System Actions

Download the species compendia for Homo Sapiens.

10

I am a computational biologist studying effect of a drug on medulloblastoma tumors. My model needs data values to be between 0 and 1.

System Actions

Users must be able to choose whether they want their data quantile normalized or to skip quantile normailzation.
Choose transformations to apply to their dataset. (Z-score, Zero to one, None)
Choose the type of files they want to receive ( Aggregated files or quant.sf files (available only for RNA-seq samples))

kurtwheeler commented 4 years ago

I forgot to link to this: https://docs.google.com/document/d/1vW05jWAEGzuzC6-py00t43MMTTQ2F73qU9c5-DZxGKk/edit

It's the OG doc we made when we were working with a contractor to build this.

kurtwheeler commented 3 years ago

Oh snap, I just found this! I didn't know it was still around. The API client is done! (Well, as done as software ever is.)

AlexsLemonade / refinebio