Open Miserlou opened 6 years ago
Perspective: bioinformaticist (ML-focused)
Situation: I've (@gwaygenomics has) built a classifier for Ras activity that uses gene expression. I'd like to apply that to all samples in SRA/GEO/ArrayExpress/ENA to identify samples with aberrant Ras signaling.
Here are examples that @dvenprasad and I worked on together. They extend beyond the functionality of the initial version of the Data Refinery and possibly into future work applying different models across Data Refinery data (e.g. a "model mill").
I think this should be helpful if part of the goal with this issue is to future-proof design choices & this level of detail will be good for worked examples down the road. We'll follow up with tickets based on version scope.
Applicable to: Data Whiz
I’m an ML researcher that specializes in natural language processing in biomedical literature. I’ve analyzed all papers in PubMed that have a Gene Expression Omnibus (GEO) data set connected to them. I've analyzed things like the co-occurrence of disease labels and pathway names (e.g., KEGG). I’m wondering if I can predict the “behavior” of the pathways in the GEO dataset from the text. For example, can I tell if a pathway will be downregulated in disease samples compared to controls based on the text of a paper and what might the text tell me about the magnitude of that change?
Consequently, I have a list of data sets that I know contain healthy controls for comparison based on my analysis of the submitter-supplied GEO abstract/description text.
I want to use multiple kinds of tools for pathway analysis and I have come up with a list of genes that I want to study based on the pathways I’m interested in.
I only want data sets that fit the following criteria: This data set "covers" 90% of the genes in my list I want to study. All data sets should have a publication associated with them.
Applicable to: Data Whiz
I’m a computational biologist that wants to predict cell line passage number and cell line identity from gene expression data. I have generated my own gold standards from the following breast cancer cell lines: MCF-7
, SkBr3
, BT-483
and BT-474
. I only want gene expression data that are reported to be samples from these cell lines for validation. I'm particularly interested in data sets where passage number is reported in the sample metadata. I would also love to provide some summary statistics about usage of different breast cancer cell lines in publicly available transcriptomics data (e.g., 45% of samples are from MCF-7
) in the introduction of my paper if possible.
Applicable to: Data Whiz
I’m a developmental biologist but I’m comfortable working with single cell RNA-seq (scRNA-seq), bulk RNA-seq and microarray data. I work with C. elegans. I’ve done a single cell RNA-seq experiment in a particular line/mutant that has an expansion in one cell lineage and those "extra" cells also demonstrate a change in expression patterns as compared to the wild type lineage (in adult worms).
I want to use bulk (whole worm) RNA-seq data from different labs that sample different stages of development for validation and exploratory analyses. I know that this data is out there based on my review of the literature.
If I can find microarray data sets of relevant mutants (relevance is determined based on my expert knowledge), that would really help take my research to the next level (assume I have an awesome method for cell type deconvolution, but it’s only appropriate for use with data from one microarray platform).
Also, I have a grant submission due soon and to demonstrate relevance to human health with some preliminary data, I want to look at whether these changes in expression patterns are present in whole tissue gastrointestinal tract biopsies in disease.
Applicable to: Data Whiz, Bio Expert, Physician
I'm a computational biologist that's developed a model that can predict (to pick a property somewhat randomly) tumor size from transcriptomics data. I have some evidence to suggest that larger tumor size is associated with better outcomes in certain cancers (A, B, and C). I remember coming across some microarray studies in cancers A, B, and C that look specifically at treatment response. I want to apply my model to baseline samples to infer tumor size and then examine the relationship between predicted tumor size and treatment response.
Applicable to: Data Whiz
I’m a computational biologist that is somewhat comfortable with machine learning. I work in industry at a company that wants to develop anti-fibrotic agents. Lots of internal research has been done using qPCR arrays rather than genome-wide assays like RNA-seq or microarrays. I want to find all gene expression data (on any platform) that's related to the disorders I study based on a list of disease terms that was given to me when I started work on this project.
It also might make sense for me to transform the Data Refinery data to make it more comparable to my qPCR data from my company or vice versa, but I'm not sure how to approach that.
Applicable to: Data Whiz, Bio Expert, Physician
I’m a bioinformatician that collaborates with a pediatric cancer lab that uses D. rerio (zebrafish) as their model system. We typically focus on one diagnosis (diagnosis Z) when looking at human cancer data. For our next publication, we want to see if the pathway we study is overexpressed in pediatric and adult cancers by doing a pancancer analysis.
Also, the recent literature suggests this pathway is altered in a (previously thought to be unrelated) monogenic disorder, disorder A. People that study disorder A have developed a small molecule that shows promising results in D. melanogaster (fly) models and these experiments include transcriptomic data. Does a comparison of publicly available (human and fly) disorder A data to our well characterized diagnosis Z data suggest my collaborators should repeat these experiments in their zebrafish model?
Applicable to: Data Whiz
I’m a bioinformatician that has recently finished up a postdoc and is trying to get out a final publication. I used two publicly available cohorts as validation in my manuscript. One of the reviewers wants me to analyze an additional four data sets that I was unaware of upon submission. Unfortunately, since I’ve left my institution, I no longer have access to the computing resources I need for reprocessing the raw data. I would like all six cohorts to be processed the same way to feel confident in my results and some of the necessary details are missing from the accessions in ArrayExpress.
Applicable to: Data Whiz, Bio Expert, Physician
I'm a computational biologist and the human disease I work on, disease A, could use some better mouse models, particularly of less common but severe manifestations of disease A like skin involvement. I have a set of related conditions I've compiled with the help of some physician scientists. I want to take an unsupervised approach to identify expression patterns that are differentially expressed in disease A and I want to use a uniformly processed compendium of human data from my curated set of conditions. Once I find the patterns that stratify patients based on the presence or absense of skin manifestations, I want to find out if there are mouse models with similar patterns in skin that can help my field. I'll need similarly processed mouse data from skin samples to do so.
Applicable to: Data Whiz, Bio Expert, Physician
I'm a bioinformatician that works on pediatric cancer. I'm interested in immune infiltrate in tumors and in methods that estimate this from gene expression data like TIMER
, xCell
, and CIBERSORT
. It occurs to me that since there are age-related changes in peripheral blood gene expression and these methods were probably generally developed using gene expression from adults, I might want to make some tweaks when applying them to the pediatric cancer domain or at least be aware of any limitations.
I have some ideas for exploratory analyses to investigate how age might affect these methods:
I'm an ML researcher and I want to train a model using all mouse data. I know this method generally works best with z-scored data.
Applicable to: Data Whiz, Bio Expert, Physician
I'm a bioinformatician that wants to predict lab- or institution-specific effects in C. albicans gene expression data. I need to do some preliminary analyses. I attempt to remove nuisance variables (like strain or platform-specific effects) and then cluster samples. I want to know if the resulting clusters are enriched for different institutions. I need the submitter supplied organization name and associated email address to do this experiment.
Applicable to: Data Whiz
I'm a bioinformatician that works in a lab with lots of experience with patient derived xenografts (PDX) for multiple types of cancer. Our lab knows that PDX models can be of varying quality and we've developed a classifier that predicts PDX "quality" based on gene expression data (e.g., +/- a necrosis signature) internally. We want to apply this classifier to publicly available data to determine whether or not data is of suitable quality to be used as a validation set, but first we need to identify all PDX data sets from the microarray platform that we use.
Applicable to: Data Whiz
I'm a computational biologist working on a project that looks at T and B cell receptor (TCR and BCR) repertoire properties using bulk (not targeted) RNA-seq data. I have some set of gold standard samples where I've done TCR and BCR sequencing and bulk RNA-seq. For some samples in my gold standard, I find that the clonal diversity results from the bulk RNA-seq pipeline are not consistent with the TCR sequencing results. This suggests that there is something about these bulk RNA-seq samples that make them unsuitable for use with this TCR immunosequencing pipeline.
To my surprise, "suitability" is highly, positively correlated with the expression level of 9 genes. I want to quickly identify samples I can use with my pipeline. I've checked using some publicly available samples processed with RSEM
, but I want to know if this observation is robust across processing methods.
Applicable to: Data Whiz I'm an ML researcher and I want to know if building deeper models can tell me about tissue specificity and the interaction between transcription factors in human expression data. I need my input values to be between zero and one and tissue labels for some portion of samples.
Applicable to: Data Whiz
I'm a ML researcher that has built a classifier that can predict the presence or absence of a particular mutation in a tumor using gene expression data as input, dropping all near zero variance genes. Now I want to identify cell lines that have similar mutation profiles. Once I find these, I’d like to identify collaborators who have uploaded samples to GEO with such profiles.
Applicable to: Data Whiz
I'm a bioinformatician that would like to predict response to rituximab regardless of diagnosis or tissue assayed. I need all human gene expression data that includes rituximab treatment and contains paired samples for patients (pre- and post-treatment).
Applicable to: Data Whiz
I'm an ML researcher that works on methods for time series data. I want to know if I can automatically identify time series gene expression data from large compendia. (Inspired by some early Greene Lab projects :) ) I have a handful of data sets that I know are time series as a starting point, but not enough to split data into training or testing.
Applicable to: Data Whiz
I'm an ML researcher that tends to use methods that are quite sensitive to duplicated data. I want to automatically detect duplicate samples in transcriptomic data without using any additional information (e.g., cancer type). I realize that sometimes what I consider to be duplicates (e.g., samples from the same individual, same tissue) might sometimes be run on different platforms or technologies. I want to find examples of GEO SuperSeries from diverse conditions where the same sample has been run on multiple microarray platforms to craft my duplicate detection method. It will save me time if the SuperSeries are uniformly processed in some way.
Applicable to: Data Whiz
I'm a computational biologist that has built a model that accurately predicts the proportion of cells in each phase of the cell cycle using gold standard data (e.g., flow cytometry and RNA-seq data) my lab has generated. Our lab studies angiogenesis, so I want to apply my model to publicly available VEGF time course data. The only data that I can find is microarray data from multiple platforms and I want all of the data, including the RNA-seq data I've generated myself, to be more comparable.
Applicable to: Data Whiz
I'm a computational biologist that works on P. aeruginosa and I've found a large data set that appears to be an experiment using the antibiotics I'm interested in on GEO. This data set doesn't have a publication associated with it and there's no strain information provided, but this data set could be an important validation set for my work. If I had a processed P. aeruginosa where most samples had strain information, I could cluster the samples from this unlabeled experiment or build a classifier to predict strain.
Applicable to: Data Whiz, Bio Expert, Physician
I am a computational biologist. I found a processed dataset which is perfect for my downstream analysis. Before I download it, I would like to gain a better understanding of the quality of the samples.
Applicable to: Data Whiz
I am an ML researcher interested in a building a new method using RNA-seq data. In order to construct my training and test sets, I only want to include samples of the highest quality (>80% mapping rate with Salmon, low % intergenic sequences). I would only like to download samples that meet my quality criteria.
Applicable to: Physician
I am a physician that has a sample from a patient with an unknown condition. I would like to identify samples that are most similar to my one sample. My sample has been normalized and processed by a collaborator, but I still have access to the raw files. I have a table of gene expression values that are mapped to HGNC Gene Symbols. I can use this normalized data to get an idea of what samples are similar, but it would be best if I had the ability to upload my raw files and have them reprocessed in the same manner as the rest of the compendium.
Applicable to: Data Whiz, Bio Expert
I’m a bench biologist and I want to do a survival analysis in publicly available data from the cancer I study, cancer X. Specifically, I want to stratify patients based on their expression levels of all genes in a pathway, rather than the expression of a single gene. I want to use a curated pathway from a widely used, openly licensed source, but I also have a custom gene set based on my review of the literature.
Applicable to: Data Whiz, Bio Expert
I’m a bench biologist that works with zebrafish and our lab did an RNA-seq experiment comparing our mutant to wild-type. I have a set of 50 genes that are upregulated in my mutant and I want to see how those genes are coexpressed across different zebrafish experiments.
Are these 50 genes capture by one of the models in your model mill?
Are there gene sets from each model (e.g., ADAGE) that significantly overlap with the 50 genes from my experiment? -> Yes, my gene set is enriched for Node 42 genes. What datasets or samples have “significantly” high or low Node 42 levels (compared to background)?
Applicable to: Data Whiz, Bio Expert
I’m a bench biologist that works with zebrafish and our lab did an RNA-seq experiment comparing our mutant to wild-type. I have a set of 50 genes that are upregulated in my mutant and I want to see how those genes are coexpressed across different zebrafish experiments.
Calculate some kind of (summary) score (ssGSEA?) based on my 50 genes for each sample in the zebrafish compendium.
Return which samples have highest or lowest score.
Applicable to: Data Whiz, Bio Expert
I’m a bench scientist and I’m looking to do some research with Cancer X. I’m not sure which genes are important for Cancer X. I would like to browse through all available datasets of Cancer X to determine a few important genes and effects of different sample characteristics on them at a superficial level. My goal would be to develop a hypothesis for further exploration.
Applicable to: Data Whiz, Bio Expert, Physician
I am a Principal Investigator that runs a wet lab. I have a grant deadline coming up and my graduate student has given me a list of 10 genes that she has prioritized use a genome-wide screen. We study Cancer X and I am wondering how these 10 genes look or behave across all samples of Cancer X. I would like to generate a heatmap across all samples of Cancer X.
Applicable to: Data Whiz, Bio Expert, Physician
Generate DE analysis -> heatmap example
Applicable to: Data Whiz, Bio Expert, Physician
I am a bench biologist that studies disease X. I am interested in examining a publicly available data set that is made up of sorted peripheral blood cells (e.g., T cells, monocytes, B cells, and neutrophils) from patients with disease X and healthy controls. I am concerned that there might be batch effects in this publicly available data set and would like perform Principal Component Analysis and visualize the results. This will allow me to inspect the data for batch effects.
Applicable to: Data Whiz, Bio Expert, Physician
I am a bench biologist and I want to know if there are clusters of samples in publicly available data from my disease of interest. Specifically, I want to know if there is evidence for disease subtypes. I would like to perform consensus clustering.
Applicable to: Data Whiz, Bio Expert, Physician
Hierarchical clustering example (with heatmap) -- select dataset(s) filter genes based on some criteria (fold-change, % missing values, variance) and then hierarchically cluster genes and/or samples
I am a bench biologist that works with A. Thaliana and I’m interested in micro-nutrient stress. I’ve been examining my genes of interest in the eFP browser. I’ve moved on to comparing some of the pathways our lab studies in Arabidopsis with O. Sativa using RNA-seq data I’ve downloaded from the Data Refinery and I’ve found some interesting results. I’m concerned about the quality of these data, as I know PCR amplification can lead to some bias in RNA-seq. I would like to examine the level of sequence duplication in each of the samples I’ve selected for my analyses.
I am a bench biologist interested in predicting classifying healthy control and tumor-adjacent colon tissue. My bioinformatics collaborator suggests I use a linear SVM to do my predictions, but I do not have any programming experience. I’ve identified two colorectal cancer (CRC) datasets in the Data Refinery that meet my requirements (they are from different platforms) and I would like to use the SVM module on GenePattern because it accepts both a training and test data set. The CRC data sets need to be subset for use with the SVM module because they both contain tumor, tumor-adjacent normal, and healthy control tissue and I am only interested in the comparison between healthy and tumor-adjacent samples. The SVM module requires:
A GCT file for each of the 2 sets. It must contain the same genes for both data sets. A CLS file for both data sets.
I am a medulloblastoma researcher that is interested in doing a meta-analysis of all available medulloblastoma data. I would like to run CoGAPS and relate the patterns I find back to histology. I’ve explored the data somewhat on using the medulloblastoma data scope on R2 and I’ve noticed the data come from different platforms. It would significantly speed up my research if both the gene expression data and the histology labels were normalized in some way.
I’m a machine learning researcher with very little biological expertise and no knowledge of different RNA assay technologies. I need a large compendium of gene expression data that has been cleaned and is reasonably comparable.
I am a researcher studying the role of high fat diets. I just read a manuscript ( Kwon EY, Shin SK, Cho YY, Jung UJ et al. Time-course microarrays reveal early activation of the immune transcriptome and adipokine dysregulation leads to fibrosis in visceral adipose depots during diet-induced obesity. BMC Genomics 2012 Sep 4;13:450. PMID: 22947075 ). I would like to download all of the data associated with the manuscript to perform my own secondary analysis to confirm the researchers' findings in data-refinery processed data.
I am a researcher interested in using the data refinery for multiple, continuously updated projects. I would like a history of the data I've downloaded from the web version and the processor versions used to process them.
I am a bench scientist that is collaborating with a team of physicians working on disease X. My collaborators have cautioned me that not all publicly available disease X gene expression data sets they have come across are appropriate for our research question. I would like to send them a list of available refinery-processed data sets for them to look over and approve.
We need examples of the types of experiments and queries which are going to be executed against
data-refinery
data.These examples should include the perspectives from bioinformaticist, bench researcher, and machine learning researcher. Ideally, we can create lots of these and have one canonical example for beginner ("Hello, World!"), intermediate and advanced scenarios which we can turn into tutorials and test cases in CI. I'm hoping we can get about 25-50 of these examples at a minimum.
This is homework for @jaclyn-taroni, @dvenprasad, probably @cgreene, and anybody else who wants to jump in.