Closed dvenprasad closed 6 years ago
Stuff that came out of today's whiteboarding session:
Filters:
Search result- requirements
Things not discussed:
Behavior expectations for
[Notes for me] Define behaviors for
Documenting from discussion: The current plan for actually building a dataset, samples and experiments have unique associations, so a sample must be added to a dataset multiple times if it's going to be used in multiple experiments. Duplicates will be removed in the species aggregation but included in each experiment with an association. There is not yet an experiment-experiment relation other than in the description text provided to us.
Summary of Low-Fi Feedback:
Low-Fi Screens:
@dvenprasad and I chatted about the responsive design of this page and came to the conclusion to hide the description/details for each experiment on the search page on mobile, at least for the Keytar Kurt iteration. It's just too much information to be viewed effectively on a mobile device. Users will have to make decisions based on the "factoid bar" and click to the individual experiment page if they want to see more details.
Summary of high Fi feedback:
Invision Prototypes: Desktop: https://invis.io/ZVI3AS2THFN Mobile: https://invis.io/X6I3BB4NAYT -> Search result variations on mobile
The badges are in the MidFi Project's Assets section.
Guidelines and states for checkboxes and numbered badges are in the master.sketch file in https://github.com/AlexsLemonade/refinebio-design
@ramenhog Let me know if there's anything else you need.
Context
This is to define scope and requirements for Search page design iteration.
Problem or idea
Relevant Features for Search page
Search by
and metadataPublication information (Use Cases: 37)( Do not have enough information about authors and publications to support this feature yet)Filter by
Platform (Use Cases: 3)(Because platform needs to normalized and mapped)Submitter's institution (Use Cases: 3)( It is more useful when users are able to search with publication information. Removing it from scope)Actions
Display
Indicate if a search result is a subseries and link to its superseries( From @Miserlou's comment ->"There is not yet an experiment-experiment relation other than in the description text provided to us." We will be able to indicate but it is not optimal to provide information about subseries.)Show superseries as search result and not its subseries if it ranks higher in the search. Indicate if a search result is a SuperSeries.
Use Cases
One
Applicable to: Data Whiz
I’m an ML researcher that specializes in natural language processing in biomedical literature. I’ve analyzed all papers in PubMed that have a Gene Expression Omnibus (GEO) data set connected to them. I've analyzed things like the co-occurrence of disease labels and pathway names (e.g., KEGG). I’m wondering if I can predict the “behavior” of the pathways in the GEO dataset from the text. For example, can I tell if a pathway will be downregulated in disease samples compared to controls based on the text of a paper and what might the text tell me about the magnitude of that change?
Consequently, I have a list of data sets that I know contain healthy controls for comparison based on my analysis of the submitter-supplied GEO abstract/description text.
I want to use multiple kinds of tools for pathway analysis and I have come up with a list of genes that I want to study based on the pathways I’m interested in.
I only want data sets that fit the following criteria: This data set "covers" 90% of the genes in my list I want to study. All data sets should have a publication associated with them.
System Actions
Two
Applicable to: Data Whiz
I’m a computational biologist that wants to predict cell line passage number and cell line identity from gene expression data. I have generated my own gold standards from the following breast cancer cell lines:
MCF-7
,SkBr3
,BT-483
andBT-474
. I only want gene expression data that are reported to be samples from these cell lines for validation. I'm particularly interested in data sets where passage number is reported in the sample metadata. I would also love to provide some summary statistics about usage of different breast cancer cell lines in publicly available transcriptomics data (e.g., 45% of samples are fromMCF-7
) in the introduction of my paper if possible.System Actions
Three
Applicable to: Data Whiz
I’m a developmental biologist but I’m comfortable working with single cell RNA-seq (scRNA-seq), bulk RNA-seq and microarray data. I work with C. elegans. I’ve done a single cell RNA-seq experiment in a particular line/mutant that has an expansion in one cell lineage and those "extra" cells also demonstrate a change in expression patterns as compared to the wild type lineage (in adult worms).
I want to use bulk (whole worm) RNA-seq data from different labs that sample different stages of development for validation and exploratory analyses. I know that this data is out there based on my review of the literature.
If I can find microarray data sets of relevant mutants (relevance is determined based on my expert knowledge), that would really help take my research to the next level (assume I have an awesome method for cell type deconvolution, but it’s only appropriate for use with data from one microarray platform).
Also, I have a grant submission due soon and to demonstrate relevance to human health with some preliminary data, I want to look at whether these changes in expression patterns are present in whole tissue gastrointestinal tract biopsies in disease.
System Actions
Seven
Applicable to: Data Whiz
I’m a bioinformatician that has recently finished up a postdoc and is trying to get out a final publication. I used two publicly available cohorts as validation in my manuscript. One of the reviewers wants me to analyze an additional four data sets that I was unaware of upon submission. Unfortunately, since I’ve left my institution, I no longer have access to the computing resources I need for reprocessing the raw data. I would like all six cohorts to be processed the same way to feel confident in my results and some of the necessary details are missing from the accessions in ArrayExpress.
System Actions
Seventeen
Applicable to: Data Whiz
I'm an ML researcher that works on methods for time series data. I want to know if I can automatically identify time series gene expression data from large compendia. (Inspired by some early Greene Lab projects :) ) I have a handful of data sets that I know are time series as a starting point, but not enough to split data into training or testing.
System Actions
Eighteen
Applicable to: Data Whiz
I'm an ML researcher that tends to use methods that are quite sensitive to duplicated data. I want to automatically detect duplicate samples in transcriptomic data without using any additional information (e.g., cancer type). I realize that sometimes what I consider to be duplicates (e.g., samples from the same individual, same tissue) might sometimes be run on different platforms or technologies. I want to find examples of GEO SuperSeries from diverse conditions where the same sample has been run on multiple microarray platforms to craft my duplicate detection method. It will save me time if the SuperSeries are uniformly processed in some way.
System Actions
Nineteen
Applicable to: Data Whiz
I'm a computational biologist that has built a model that accurately predicts the proportion of cells in each phase of the cell cycle using gold standard data (e.g., flow cytometry and RNA-seq data) my lab has generated. Our lab studies angiogenesis, so I want to apply my model to publicly available VEGF time course data. The only data that I can find is microarray data from multiple platforms and I want all of the data, including the RNA-seq data I've generated myself, to be more comparable.
System Actions
Twenty-three
Applicable to: Physician
I am a physician that has a sample from a patient with an unknown condition. I would like to identify samples that are most similar to my one sample. My sample has been normalized and processed by a collaborator, but I still have access to the raw files. I have a table of gene expression values that are mapped to HGNC Gene Symbols. I can use this normalized data to get an idea of what samples are similar, but it would be best if I had the ability to upload my raw files and have them reprocessed in the same manner as the rest of the compendium.
System Actions
Thirty-five
I am a medulloblastoma researcher that is interested in doing a meta-analysis of all available medulloblastoma data. I would like to run CoGAPS and relate the patterns I find back to histology. I’ve explored the data somewhat on using the medulloblastoma data scope on R2 and I’ve noticed the data come from different platforms. It would significantly speed up my research if both the gene expression data and the histology labels were normalized in some way.
System Actions
Thirty-seven
I am a researcher studying the role of high fat diets. I just read a manuscript ( Kwon EY, Shin SK, Cho YY, Jung UJ et al. Time-course microarrays reveal early activation of the immune transcriptome and adipokine dysregulation leads to fibrosis in visceral adipose depots during diet-induced obesity. BMC Genomics 2012 Sep 4;13:450. PMID: 22947075 ). I would like to download all of the data associated with the manuscript to perform my own secondary analysis to confirm the researchers' findings in data-refinery processed data.
System Actions
Solution or next step
@jaclyn-taroni, @Miserlou, and I will have a whiteboarding session to outline workflows and layout.