Closed kurtwheeler closed 3 years ago
I'll also leave these use cases written up by @dvenprasad :+1:
I’m a bioinformatician that has recently finished up a postdoc and is trying to get out a final publication. I used two publicly available cohorts as validation in my manuscript. One of the reviewers wants me to analyze an additional four data sets that I was unaware of upon submission. Unfortunately, since I’ve left my institution, I no longer have access to the computing resources I need for reprocessing the raw data. I would like all six cohorts to be processed the same way to feel confident in my results and some of the necessary details are missing from the accessions in ArrayExpress.
System Actions
I'm a bioinformatician that wants to predict lab- or institution-specific effects in C. albicans gene expression data. I need to do some preliminary analyses. I attempt to remove nuisance variables (like strain or platform-specific effects) and then cluster samples. I want to know if the resulting clusters are enriched for different institutions. I need the submitter supplied organization name and associated email address to do this experiment.
System Actions
I'm an ML researcher and I want to know if building deeper models can tell me about tissue specificity and the interaction between transcription factors in human expression data. I need my input values to be between zero and one and tissue labels for some portion of samples.
System Actions
I'm a ML researcher that has built a classifier that can predict the presence or absence of a particular mutation in a tumor using gene expression data as input, dropping all near zero variance genes. Now I want to identify cell lines that have similar mutation profiles. Once I find these, I’d like to identify collaborators who have uploaded samples to GEO with such profiles.
System Actions
I'm a computational biologist that works on P. aeruginosa and I've found a large data set that appears to be an experiment using the antibiotics I'm interested in on GEO. This data set doesn't have a publication associated with it and there's no strain information provided, but this data set could be an important validation set for my work. If I had a processed P. aeruginosa where most samples had strain information, I could cluster the samples from this unlabeled experiment or build a classifier to predict strain.
System Actions *Search by organism.
I am a medulloblastoma researcher that is interested in doing a meta-analysis of all available medulloblastoma data. I would like to run CoGAPS and relate the patterns I find back to histology. I’ve explored the data somewhat on using the medulloblastoma data scope on R2 and I’ve noticed the data come from different platforms. It would significantly speed up my research if both the gene expression data and the histology labels were normalized in some way.
System Actions
I am a computational biologist studying osteosarcoma in zebrafish. I have some raw data and I would like to validate it against refine.bio data. I need to process my data similarly to refine.bio.
System actions
I am a computational biologist studying hypoxia in zebrafish. For my experiment design, I only require the quant.sf output from salmon.
System Actions
I'm a ML researcher studying medulloblastoma. I have built a model which predicts if Drug X works on medullobalstoma. I am interested in observing its effects in medulloblastoma adjacent diseases.
System Actions
I am a computational biologist studying effect of a drug on medulloblastoma tumors. My model needs data values to be between 0 and 1.
System Actions
I forgot to link to this: https://docs.google.com/document/d/1vW05jWAEGzuzC6-py00t43MMTTQ2F73qU9c5-DZxGKk/edit
It's the OG doc we made when we were working with a contractor to build this.
Oh snap, I just found this! I didn't know it was still around. The API client is done! (Well, as done as software ever is.)
Data Model Classes
All data models should have a corresponding class.
get
andlist
class method.get_uploader_contact()
will return the contact information for the uploaderget_uploader_contact()
should call the corresponding experiment'sget_uploader_contact()
method.High Level Functions
We'll need a number of high level functions:
help(entity=none)
download_dataset(dataset_dict=None, experiments=None, samples=None, aggregation="EXPERIMENT", transformation="NONE", skip_quantile_normalization=False)
download_compendium(organism, path, quant_sf_only=False)
download_quantfile_compendium(organism, path)
download_compendium(organism, path, quant_sf_only=True)
API Tokens
We'll need to have good support for the tokens. The following functions will be needed:
create_token(email_address)
agree_to_terms_and_conditions(api_token)
save_token(api_token, file_path=os.getenv("CONFIG_FILE", "~/.refinebio.yaml")
load_token(file_path=os.getenv("CONFIG_FILE", "~/.refinebio.yaml"))
I'd like to do the config file as a YAML file so we can add additional keys to it in the future if that ever becomes necessary. (Off the top of my head I could think of maybe allowing someone to turn on the cache with them if we ever built that.)
Documentation/Error messages
These will be really important. We'll want to thoroughly validate all the parameters to all functions so if any are invalid we can explain it nicely. All classes, methods, and functions should have a docstring that explains all parameters, along with an easy-to-understand description of what it can do and how to use it.