Brainstorm: Example Experiments and Queries

Miserlou commented 6 years ago

We need examples of the types of experiments and queries which are going to be executed against data-refinery data.

These examples should include the perspectives from bioinformaticist, bench researcher, and machine learning researcher. Ideally, we can create lots of these and have one canonical example for beginner ("Hello, World!"), intermediate and advanced scenarios which we can turn into tutorials and test cases in CI. I'm hoping we can get about 25-50 of these examples at a minimum.

This is homework for @jaclyn-taroni, @dvenprasad, probably @cgreene, and anybody else who wants to jump in.

cgreene commented 6 years ago

Perspective: bioinformaticist (ML-focused)

Situation: I've (@gwaygenomics has) built a classifier for Ras activity that uses gene expression. I'd like to apply that to all samples in SRA/GEO/ArrayExpress/ENA to identify samples with aberrant Ras signaling.

jaclyn-taroni commented 6 years ago

Here are examples that @dvenprasad and I worked on together. They extend beyond the functionality of the initial version of the Data Refinery and possibly into future work applying different models across Data Refinery data (e.g. a "model mill").

I think this should be helpful if part of the goal with this issue is to future-proof design choices & this level of detail will be good for worked examples down the road. We'll follow up with tickets based on version scope.

One

Applicable to: Data Whiz

I’m an ML researcher that specializes in natural language processing in biomedical literature. I’ve analyzed all papers in PubMed that have a Gene Expression Omnibus (GEO) data set connected to them. I've analyzed things like the co-occurrence of disease labels and pathway names (e.g., KEGG). I’m wondering if I can predict the “behavior” of the pathways in the GEO dataset from the text. For example, can I tell if a pathway will be downregulated in disease samples compared to controls based on the text of a paper and what might the text tell me about the magnitude of that change?

Consequently, I have a list of data sets that I know contain healthy controls for comparison based on my analysis of the submitter-supplied GEO abstract/description text.

I want to use multiple kinds of tools for pathway analysis and I have come up with a list of genes that I want to study based on the pathways I’m interested in.

I only want data sets that fit the following criteria: This data set "covers" 90% of the genes in my list I want to study. All data sets should have a publication associated with them.

System Actions

User must be able to search with a list of accession numbers. (e.g comma separated values)
User must be able to filter by providing a list of genes and set a value for tolerance of missing genes. The system should search results based on genes present in the sample per dataset and not based on the genes that can be measured on a platform.
Filter by whether a dataset is associated with a publication or not.

Two

Applicable to: Data Whiz

I’m a computational biologist that wants to predict cell line passage number and cell line identity from gene expression data. I have generated my own gold standards from the following breast cancer cell lines: MCF-7, SkBr3, BT-483 and BT-474. I only want gene expression data that are reported to be samples from these cell lines for validation. I'm particularly interested in data sets where passage number is reported in the sample metadata. I would also love to provide some summary statistics about usage of different breast cancer cell lines in publicly available transcriptomics data (e.g., 45% of samples are from MCF-7) in the introduction of my paper if possible.

System Actions

User must able to do a full-text search of the abstract and experiment name with key terms like cell-line name or terms like “breast cancer cell-line” or a list of cell-lines.
User must be able to select certain search results to download later (shopping cart) based on their expert knowledge.

Three

Applicable to: Data Whiz

I’m a developmental biologist but I’m comfortable working with single cell RNA-seq (scRNA-seq), bulk RNA-seq and microarray data. I work with C. elegans. I’ve done a single cell RNA-seq experiment in a particular line/mutant that has an expansion in one cell lineage and those "extra" cells also demonstrate a change in expression patterns as compared to the wild type lineage (in adult worms).

I want to use bulk (whole worm) RNA-seq data from different labs that sample different stages of development for validation and exploratory analyses. I know that this data is out there based on my review of the literature.

If I can find microarray data sets of relevant mutants (relevance is determined based on my expert knowledge), that would really help take my research to the next level (assume I have an awesome method for cell type deconvolution, but it’s only appropriate for use with data from one microarray platform).

Also, I have a grant submission due soon and to demonstrate relevance to human health with some preliminary data, I want to look at whether these changes in expression patterns are present in whole tissue gastrointestinal tract biopsies in disease.

System Actions

User must be able to filter search results by data-type(i.e RNA-seq, microarray), source organ(e.g cell-lines, blood), organism, disease, platform, and by source of the data( i.e submitter institution)

Four

Applicable to: Data Whiz, Bio Expert, Physician

I'm a computational biologist that's developed a model that can predict (to pick a property somewhat randomly) tumor size from transcriptomics data. I have some evidence to suggest that larger tumor size is associated with better outcomes in certain cancers (A, B, and C). I remember coming across some microarray studies in cancers A, B, and C that look specifically at treatment response. I want to apply my model to baseline samples to infer tumor size and then examine the relationship between predicted tumor size and treatment response.

System Actions

User must be able search by disease, and filter to retain only results which have treatment information.

Five

Applicable to: Data Whiz

I’m a computational biologist that is somewhat comfortable with machine learning. I work in industry at a company that wants to develop anti-fibrotic agents. Lots of internal research has been done using qPCR arrays rather than genome-wide assays like RNA-seq or microarrays. I want to find all gene expression data (on any platform) that's related to the disorders I study based on a list of disease terms that was given to me when I started work on this project.

It also might make sense for me to transform the Data Refinery data to make it more comparable to my qPCR data from my company or vice versa, but I'm not sure how to approach that.

System Actions

User must be able to input the type of data they would be comparing the data from the refinery to.
User must be able to download raw data from the search results.
Suggest transformation techniques or do the transformation for them and give them the data.

Six

Applicable to: Data Whiz, Bio Expert, Physician

I’m a bioinformatician that collaborates with a pediatric cancer lab that uses D. rerio (zebrafish) as their model system. We typically focus on one diagnosis (diagnosis Z) when looking at human cancer data. For our next publication, we want to see if the pathway we study is overexpressed in pediatric and adult cancers by doing a pancancer analysis.

Also, the recent literature suggests this pathway is altered in a (previously thought to be unrelated) monogenic disorder, disorder A. People that study disorder A have developed a small molecule that shows promising results in D. melanogaster (fly) models and these experiments include transcriptomic data. Does a comparison of publicly available (human and fly) disorder A data to our well characterized diagnosis Z data suggest my collaborators should repeat these experiments in their zebrafish model?

System Actions

User must be able retrieve all human cancer related datasets.
User must be able retrieve a mapping of genes between multiple organism for a particular set of samples. Also, maybe a list of genes which are common. (Super helpful for Bio Experts and Physicians)

Seven

Applicable to: Data Whiz

I’m a bioinformatician that has recently finished up a postdoc and is trying to get out a final publication. I used two publicly available cohorts as validation in my manuscript. One of the reviewers wants me to analyze an additional four data sets that I was unaware of upon submission. Unfortunately, since I’ve left my institution, I no longer have access to the computing resources I need for reprocessing the raw data. I would like all six cohorts to be processed the same way to feel confident in my results and some of the necessary details are missing from the accessions in ArrayExpress.

System Actions

User must be able to search with a list of experiment accession numbers, either from GEO or SRA or ArrayExpress.

Eight

Applicable to: Data Whiz, Bio Expert, Physician

I'm a computational biologist and the human disease I work on, disease A, could use some better mouse models, particularly of less common but severe manifestations of disease A like skin involvement. I have a set of related conditions I've compiled with the help of some physician scientists. I want to take an unsupervised approach to identify expression patterns that are differentially expressed in disease A and I want to use a uniformly processed compendium of human data from my curated set of conditions. Once I find the patterns that stratify patients based on the presence or absense of skin manifestations, I want to find out if there are mouse models with similar patterns in skin that can help my field. I'll need similarly processed mouse data from skin samples to do so.

System Actions

Search by organism and filter by disease, tissue.
Provide a gene mapping between organisms.

Nine

Applicable to: Data Whiz, Bio Expert, Physician

I'm a bioinformatician that works on pediatric cancer. I'm interested in immune infiltrate in tumors and in methods that estimate this from gene expression data like TIMER, xCell, and CIBERSORT. It occurs to me that since there are age-related changes in peripheral blood gene expression and these methods were probably generally developed using gene expression from adults, I might want to make some tweaks when applying them to the pediatric cancer domain or at least be aware of any limitations.

I have some ideas for exploratory analyses to investigate how age might affect these methods:

I want age adjusted expression data from sorted/purified immune cell subsets to redefine the reference data used for some of these methods.
I want to use immune cell estimation methods on blood data where counts and age are known and see if there are significant differences between "old" and "young" samples.
I want (adult and pediatric) pancancer gene expression data following age adjustment (via regression) to use as input for immune infiltrate estimation methods.

System Actions

User must have access to age distribution information for a dataset.
User must be able to download immune cell estimates.
User must be able to download adjusted gene expression data.

Ten

I'm an ML researcher and I want to train a model using all mouse data. I know this method generally works best with z-scored data.

System Actions

Filter by organism.
User must be able to choose transformation to be applied to dataset before download.
User must be able to download a matrix of gene expression values.

Eleven

Applicable to: Data Whiz, Bio Expert, Physician

I'm a bioinformatician that wants to predict lab- or institution-specific effects in C. albicans gene expression data. I need to do some preliminary analyses. I attempt to remove nuisance variables (like strain or platform-specific effects) and then cluster samples. I want to know if the resulting clusters are enriched for different institutions. I need the submitter supplied organization name and associated email address to do this experiment.

System Actions

User must be able to download a matrix of gene expression values for an organism.
User must be able to easily extract the mandatory GEO fields through the API and GUI.

Twelve

Applicable to: Data Whiz

I'm a bioinformatician that works in a lab with lots of experience with patient derived xenografts (PDX) for multiple types of cancer. Our lab knows that PDX models can be of varying quality and we've developed a classifier that predicts PDX "quality" based on gene expression data (e.g., +/- a necrosis signature) internally. We want to apply this classifier to publicly available data to determine whether or not data is of suitable quality to be used as a validation set, but first we need to identify all PDX data sets from the microarray platform that we use.

System Actions

Filter by technique(PDX, Gene knock down)
User must be able to download a matrix of gene expression.

Thirteen

Applicable to: Data Whiz

I'm a computational biologist working on a project that looks at T and B cell receptor (TCR and BCR) repertoire properties using bulk (not targeted) RNA-seq data. I have some set of gold standard samples where I've done TCR and BCR sequencing and bulk RNA-seq. For some samples in my gold standard, I find that the clonal diversity results from the bulk RNA-seq pipeline are not consistent with the TCR sequencing results. This suggests that there is something about these bulk RNA-seq samples that make them unsuitable for use with this TCR immunosequencing pipeline.

To my surprise, "suitability" is highly, positively correlated with the expression level of 9 genes. I want to quickly identify samples I can use with my pipeline. I've checked using some publicly available samples processed with RSEM, but I want to know if this observation is robust across processing methods.

System Actions

Search by gene list, filter by processing steps

Fourteen

Applicable to: Data Whiz I'm an ML researcher and I want to know if building deeper models can tell me about tissue specificity and the interaction between transcription factors in human expression data. I need my input values to be between zero and one and tissue labels for some portion of samples.

System Actions

Users must be able to transform dataset before download.
Also need metadata to be available alongside samples.
Need to be able to download gene expression matrix.

Fifteen

Applicable to: Data Whiz

I'm a ML researcher that has built a classifier that can predict the presence or absence of a particular mutation in a tumor using gene expression data as input, dropping all near zero variance genes. Now I want to identify cell lines that have similar mutation profiles. Once I find these, I’d like to identify collaborators who have uploaded samples to GEO with such profiles.

System Actions

Users must be able to extract uploader’s contact information via API.

Sixteen

Applicable to: Data Whiz

I'm a bioinformatician that would like to predict response to rituximab regardless of diagnosis or tissue assayed. I need all human gene expression data that includes rituximab treatment and contains paired samples for patients (pre- and post-treatment).

System Actions

Users must be able to sort and filter samples within datasets to help them explore it.

Seventeen

Applicable to: Data Whiz

I'm an ML researcher that works on methods for time series data. I want to know if I can automatically identify time series gene expression data from large compendia. (Inspired by some early Greene Lab projects :) ) I have a handful of data sets that I know are time series as a starting point, but not enough to split data into training or testing.

System Actions

Search terms “time-series”.

Eighteen

Applicable to: Data Whiz

I'm an ML researcher that tends to use methods that are quite sensitive to duplicated data. I want to automatically detect duplicate samples in transcriptomic data without using any additional information (e.g., cancer type). I realize that sometimes what I consider to be duplicates (e.g., samples from the same individual, same tissue) might sometimes be run on different platforms or technologies. I want to find examples of GEO SuperSeries from diverse conditions where the same sample has been run on multiple microarray platforms to craft my duplicate detection method. It will save me time if the SuperSeries are uniformly processed in some way.

System Actions

User must be able to filter search results by whether they are superseries or not.
Search results must indicate if a series is part of a super series. If all the series in a superseries meet the search criteria, only display the superseries.

Nineteen

Applicable to: Data Whiz

I'm a computational biologist that has built a model that accurately predicts the proportion of cells in each phase of the cell cycle using gold standard data (e.g., flow cytometry and RNA-seq data) my lab has generated. Our lab studies angiogenesis, so I want to apply my model to publicly available VEGF time course data. The only data that I can find is microarray data from multiple platforms and I want all of the data, including the RNA-seq data I've generated myself, to be more comparable.

System Actions

Search terms “VEGF”

Twenty

Applicable to: Data Whiz

I'm a computational biologist that works on P. aeruginosa and I've found a large data set that appears to be an experiment using the antibiotics I'm interested in on GEO. This data set doesn't have a publication associated with it and there's no strain information provided, but this data set could be an important validation set for my work. If I had a processed P. aeruginosa where most samples had strain information, I could cluster the samples from this unlabeled experiment or build a classifier to predict strain.

System Actions

Search by organism.

Twenty-one

Applicable to: Data Whiz, Bio Expert, Physician

I am a computational biologist. I found a processed dataset which is perfect for my downstream analysis. Before I download it, I would like to gain a better understanding of the quality of the samples.

System Actions

User must have access to quality reports from processing. They should be able to view a high-level quality report easily (without having to download the data).
User must have access to processing steps for the dataset.
They should also have access to the detailed quality report. This could be in form of a file part of the data download zip.

Twenty-two

Applicable to: Data Whiz

I am an ML researcher interested in a building a new method using RNA-seq data. In order to construct my training and test sets, I only want to include samples of the highest quality (>80% mapping rate with Salmon, low % intergenic sequences). I would only like to download samples that meet my quality criteria.

System Actions

User must be able to filter search results based on quality control fields like Mapping rate, percent reads that deviate from salmon inferred library type.

Twenty-three

Applicable to: Physician

I am a physician that has a sample from a patient with an unknown condition. I would like to identify samples that are most similar to my one sample. My sample has been normalized and processed by a collaborator, but I still have access to the raw files. I have a table of gene expression values that are mapped to HGNC Gene Symbols. I can use this normalized data to get an idea of what samples are similar, but it would be best if I had the ability to upload my raw files and have them reprocessed in the same manner as the rest of the compendium.

System Actions

User must be able to use a different gene identifiers and still be able to search through the data refinery i.e system should be able to map different gene identifiers to ENSG ids.
Only compare to human samples
Return list of similar samples
If multiple (compression) models (model mill) exist for the human compendium, let me pick which model to use with a tooltip with some guidance; also a reasonable default exists
Ranked list of samples (by some quantitative measure depending on model) that can be expanded to reveal abstract, title, etc. of experiment that sample originated from.

Twenty-four

Applicable to: Data Whiz, Bio Expert

I’m a bench biologist and I want to do a survival analysis in publicly available data from the cancer I study, cancer X. Specifically, I want to stratify patients based on their expression levels of all genes in a pathway, rather than the expression of a single gene. I want to use a curated pathway from a widely used, openly licensed source, but I also have a custom gene set based on my review of the literature.

System Actions

Search by disease, filter by whether dataset contains survival information
User must to able to use openly licensed genesets or their own geneset(s) for analysis.
User must be able to choose method for pathway analysis. The system should be able to validate the provided geneset and dataset and pre-determine which methods are appropriate.
Provide options for stratification(i.e by mean or median) along with a recommended option.
Allow users to download visualizations .svg, .png, .jpg

Twenty-five

Applicable to: Data Whiz, Bio Expert

I’m a bench biologist that works with zebrafish and our lab did an RNA-seq experiment comparing our mutant to wild-type. I have a set of 50 genes that are upregulated in my mutant and I want to see how those genes are coexpressed across different zebrafish experiments.

Are these 50 genes capture by one of the models in your model mill?

Are there gene sets from each model (e.g., ADAGE) that significantly overlap with the 50 genes from my experiment? -> Yes, my gene set is enriched for Node 42 genes. What datasets or samples have “significantly” high or low Node 42 levels (compared to background)?

Twenty-six

Applicable to: Data Whiz, Bio Expert

I’m a bench biologist that works with zebrafish and our lab did an RNA-seq experiment comparing our mutant to wild-type. I have a set of 50 genes that are upregulated in my mutant and I want to see how those genes are coexpressed across different zebrafish experiments.

Calculate some kind of (summary) score (ssGSEA?) based on my 50 genes for each sample in the zebrafish compendium.

Return which samples have highest or lowest score.

System Actions

User must be able to choose method to ‘Score Samples’ and then upload their geneset/ gene.
Alternatively (possibly a superior option based on scoring method), return datasets that have fold-changes/variance within the set above some threshold. These datasets are likely to be of interest for follow-up.
Users must be able to choose datasets which they would like to score. Choose method to score samples and set threshold and upload geneset.

Twenty-seven

Applicable to: Data Whiz, Bio Expert

I’m a bench scientist and I’m looking to do some research with Cancer X. I’m not sure which genes are important for Cancer X. I would like to browse through all available datasets of Cancer X to determine a few important genes and effects of different sample characteristics on them at a superficial level. My goal would be to develop a hypothesis for further exploration.

System Actions

Search by disease, choose relevant search results, and get summary stats.
User must be able to group samples either across datasets or within a dataset and group within the subgroups to identify genes/sample characteristics of interest.
Alternatively, with normalized metadata, some of the grouping could be offered as a filter.

Twenty-eight

Applicable to: Data Whiz, Bio Expert, Physician

I am a Principal Investigator that runs a wet lab. I have a grant deadline coming up and my graduate student has given me a list of 10 genes that she has prioritized use a genome-wide screen. We study Cancer X and I am wondering how these 10 genes look or behave across all samples of Cancer X. I would like to generate a heatmap across all samples of Cancer X.

System Actions

Search for Cancer X and choose relevant search results.
Users must be able to choose different types of analysis that they can do across datasets. The user would choose a heatmap analysis.
User must be able to upload/enter genes of interest and generate a heatmap.
Download visualization as a .svg, .png.

Twenty-nine

Applicable to: Data Whiz, Bio Expert, Physician

Generate DE analysis -> heatmap example

System Actions

User must be able to create sample groups across datasets and choose differential expression analysis.
Users must be able to choose genes from the results of the differential expression analysis for further analysis. In this case, heatmap analysis.
Download visualization as a .svg, .png.

Thirty

Applicable to: Data Whiz, Bio Expert, Physician

I am a bench biologist that studies disease X. I am interested in examining a publicly available data set that is made up of sorted peripheral blood cells (e.g., T cells, monocytes, B cells, and neutrophils) from patients with disease X and healthy controls. I am concerned that there might be batch effects in this publicly available data set and would like perform Principal Component Analysis and visualize the results. This will allow me to inspect the data for batch effects.

System Actions

Users should be able to choose PCA to be applied to a dataset or across datasets.
The users should have the following affordances while exploring the results of the analysis:
- Assign PCs to axes (can range between 1-3)
- Be able to visualize each PC as a box plot
- Be able to group metadata and use those groups to assign colors.
In case of a box plot, assign boxes to the metadata groups.
Users must be able to download the results of PCA and the visualizations they have generated.

Thirty-one

Applicable to: Data Whiz, Bio Expert, Physician

I am a bench biologist and I want to know if there are clusters of samples in publicly available data from my disease of interest. Specifically, I want to know if there is evidence for disease subtypes. I would like to perform consensus clustering.

System Actions

Users must be able to choose to do consensus clustering across datasets or for one dataset.
Users must be able to download visualizations.

Thirty-two

Applicable to: Data Whiz, Bio Expert, Physician

Hierarchical clustering example (with heatmap) -- select dataset(s) filter genes based on some criteria (fold-change, % missing values, variance) and then hierarchically cluster genes and/or samples

System Actions

Users must be able to filter genes across all samples in a dataset or across datasets based on parameters like fold-change, % missing values, variance).
Users must be able to generate heatmaps for a dataset or across datasets.
They should be able to choose they type of method to cluster (e.g hierarchical clustering)
Should have options for hierarchical clustering parameters (e.g., linkage, distance) with some reasonable default

Thirty-three

I am a bench biologist that works with A. Thaliana and I’m interested in micro-nutrient stress. I’ve been examining my genes of interest in the eFP browser. I’ve moved on to comparing some of the pathways our lab studies in Arabidopsis with O. Sativa using RNA-seq data I’ve downloaded from the Data Refinery and I’ve found some interesting results. I’m concerned about the quality of these data, as I know PCR amplification can lead to some bias in RNA-seq. I would like to examine the level of sequence duplication in each of the samples I’ve selected for my analyses.

System Actions

User must be able to get a summary of the QC report on a per sample basis when looking at a dataset.
Users must be able to download an in-depth QC report while inspecting the dataset.

Thirty-four

I am a bench biologist interested in predicting classifying healthy control and tumor-adjacent colon tissue. My bioinformatics collaborator suggests I use a linear SVM to do my predictions, but I do not have any programming experience. I’ve identified two colorectal cancer (CRC) datasets in the Data Refinery that meet my requirements (they are from different platforms) and I would like to use the SVM module on GenePattern because it accepts both a training and test data set. The CRC data sets need to be subset for use with the SVM module because they both contain tumor, tumor-adjacent normal, and healthy control tissue and I am only interested in the comparison between healthy and tumor-adjacent samples. The SVM module requires:

A GCT file for each of the 2 sets. It must contain the same genes for both data sets. A CLS file for both data sets.

System Actions

Users must be able to have the option of downloading additional files along with the data like .cls, full QC reports.

Thirty-five

I am a medulloblastoma researcher that is interested in doing a meta-analysis of all available medulloblastoma data. I would like to run CoGAPS and relate the patterns I find back to histology. I’ve explored the data somewhat on using the medulloblastoma data scope on R2 and I’ve noticed the data come from different platforms. It would significantly speed up my research if both the gene expression data and the histology labels were normalized in some way.

System Actions

Search with accession numbers.
Download the data.

Thirty-six

I’m a machine learning researcher with very little biological expertise and no knowledge of different RNA assay technologies. I need a large compendium of gene expression data that has been cleaned and is reasonably comparable.

Requires cross-platform normalization

dvenprasad commented 6 years ago

Version 1 (API ONLY)

User must be able to search by:

List of accession numbers (Use Cases: 1, 35)
Gene (Use Cases: )

Users must be able to filter results by:

Data Type (Use Case: 3, 7)
Organism (Use Cases: 3)
Gene (Use Cases: )

Users must be able to access:

Processing steps (Use Cases: 15, 21)
Mandatory GEO fields (Use Cases: 11, 15)

Users must be able to download:

Data (Use Cases: 1, 2, 3, 4, 7, 9, 10, 11, 12, 13, 14, 17, 18, 19, 20, 22, 35)
Full QC report (Use Cases: 21, 33, 34)
Raw data (Use Cases: 5)

Version 1.1 (API + GUI)

Users must be able to search by:

Gene List (Use Cases: 13)
Free text search (Use Cases: 1, 17, 19)
Organism (Use Cases: 6, 8, 10, 20)
Alternate Gene Identifiers (Use Cases: 23)
PMID (Use Cases: 37)

Users must be able to filter results by:

Platform(?) (Use Cases: 3,)
Is Associated publication (Use Cases: 1)
Is superseries? (Use Cases: 18)
Submitter's Institution (Use Cases: 3)

Users must be able to:

Bulk download search results(Uses Cases: 2)

Version X

Users must be able to search by:

Disease (Use Cases: 4, 27, 28)

Users must be able to filter results by:

List of Genes (Use Cases: 1)
Disease (Use Cases: 3, 8)
Tissue (Use Cases: 3, 8)
Treatment (Use Cases: 4)
Technique (Use Cases: 12)
is Time series? (Use Cases: )
Processing steps (Use Cases: 13)
QC Fields (Use Cases: 22)
Survival Information (Use Cases: 24)

Users must be able to access:

Distribution of submitter supplied information (Use Cases: 9)
Overview of QC report for dataset (Use Cases: 21, 33)
Summary of QC per sample (Use Cases: 33)

Users must be able to download:

Visualizations (Use Cases: 24, 28, 29, 30, 31, 32)
Additional files for further analysis (Use Cases: 34)
Transformed Data (Use Cases: 5, 6, 8, 10, 14, 23, 30)

Users must be able to:

Ortholog mapping (Use Cases: 6, 8)
Retrieve summary stats across datasets (Use Cases: )
Apply transformations (Use Cases: 5, 10, 14)
Sort and filter samples within a dataset (Use Cases: 16)
Upload sample and find similar samples (Use Cases: 23)
Upload own genesets or use openly licensed sets (Use Cases: 24, 26, 28)
Group samples within datasets (Use Cases: 27)
Group samples across datasets (Use Cases: 27, 29)

Users must be able to do following analyses:

Pathway analysis (Use Cases: 24)
Sample grouping (Use Cases: )
Score Samples (Use Cases: 26)
Heatmap analysis (Use Cases: 28, 29, 32)
Differential Expression (Use Cases: 29)
PCA (Use Cases: 30)
Consensus Clustering (Use Cases: 31)
Hierarchical Clustering (Use Cases: 32)

cgreene commented 6 years ago

Thirty-seven

I am a researcher studying the role of high fat diets. I just read a manuscript ( Kwon EY, Shin SK, Cho YY, Jung UJ et al. Time-course microarrays reveal early activation of the immune transcriptome and adipokine dysregulation leads to fibrosis in visceral adipose depots during diet-induced obesity. BMC Genomics 2012 Sep 4;13:450. PMID: 22947075 ). I would like to download all of the data associated with the manuscript to perform my own secondary analysis to confirm the researchers' findings in data-refinery processed data.

System Actions

Search with publication information.
Download the data.

jaclyn-taroni commented 6 years ago

Thirty-eight

I am a researcher interested in using the data refinery for multiple, continuously updated projects. I would like a history of the data I've downloaded from the web version and the processor versions used to process them.

System actions

Optional user accounts (@csgreene mentioned ORCID as a possibility)
Keep track of user activity (e.g., download history)

dvenprasad commented 6 years ago

Thirty-nine

I am a bench scientist that is collaborating with a team of physicians working on disease X. My collaborators have cautioned me that not all publicly available disease X gene expression data sets they have come across are appropriate for our research question. I would like to send them a list of available refinery-processed data sets for them to look over and approve.

System actions

Users must be able to generate a link to share the list of data sets added to their download queue.

AlexsLemonade / refinebio