AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Search Page Iteration: Scope, Requirements, Design Decisions, and Mockups #216

Closed dvenprasad closed 6 years ago

dvenprasad commented 6 years ago

Context

This is to define scope and requirements for Search page design iteration.

Problem or idea

Relevant Features for Search page

Use Cases

One

Applicable to: Data Whiz

I’m an ML researcher that specializes in natural language processing in biomedical literature. I’ve analyzed all papers in PubMed that have a Gene Expression Omnibus (GEO) data set connected to them. I've analyzed things like the co-occurrence of disease labels and pathway names (e.g., KEGG). I’m wondering if I can predict the “behavior” of the pathways in the GEO dataset from the text. For example, can I tell if a pathway will be downregulated in disease samples compared to controls based on the text of a paper and what might the text tell me about the magnitude of that change?

Consequently, I have a list of data sets that I know contain healthy controls for comparison based on my analysis of the submitter-supplied GEO abstract/description text.

I want to use multiple kinds of tools for pathway analysis and I have come up with a list of genes that I want to study based on the pathways I’m interested in.

I only want data sets that fit the following criteria: This data set "covers" 90% of the genes in my list I want to study. All data sets should have a publication associated with them.

System Actions
  1. User must be able to search with a list of accession numbers. (e.g comma separated values)
  2. User must be able to filter by providing a list of genes and set a value for tolerance of missing genes. The system should search results based on genes present in the sample per dataset and not based on the genes that can be measured on a platform.
  3. Filter by whether a dataset is associated with a publication or not.

Two

Applicable to: Data Whiz

I’m a computational biologist that wants to predict cell line passage number and cell line identity from gene expression data. I have generated my own gold standards from the following breast cancer cell lines: MCF-7, SkBr3, BT-483 and BT-474. I only want gene expression data that are reported to be samples from these cell lines for validation. I'm particularly interested in data sets where passage number is reported in the sample metadata. I would also love to provide some summary statistics about usage of different breast cancer cell lines in publicly available transcriptomics data (e.g., 45% of samples are from MCF-7) in the introduction of my paper if possible.

System Actions
  1. User must able to do a full-text search of the abstract and experiment name with key terms like cell-line name or terms like “breast cancer cell-line” or a list of cell-lines.
  2. User must be able to select certain search results to download later (shopping cart) based on their expert knowledge.

Three

Applicable to: Data Whiz

I’m a developmental biologist but I’m comfortable working with single cell RNA-seq (scRNA-seq), bulk RNA-seq and microarray data. I work with C. elegans. I’ve done a single cell RNA-seq experiment in a particular line/mutant that has an expansion in one cell lineage and those "extra" cells also demonstrate a change in expression patterns as compared to the wild type lineage (in adult worms).

I want to use bulk (whole worm) RNA-seq data from different labs that sample different stages of development for validation and exploratory analyses. I know that this data is out there based on my review of the literature.

If I can find microarray data sets of relevant mutants (relevance is determined based on my expert knowledge), that would really help take my research to the next level (assume I have an awesome method for cell type deconvolution, but it’s only appropriate for use with data from one microarray platform).

Also, I have a grant submission due soon and to demonstrate relevance to human health with some preliminary data, I want to look at whether these changes in expression patterns are present in whole tissue gastrointestinal tract biopsies in disease.

System Actions
  1. User must be able to filter search results by data-type(i.e RNA-seq, microarray), source organ(e.g cell-lines, blood), organism, disease, platform, and by source of the data( i.e submitter institution)

Seven

Applicable to: Data Whiz

I’m a bioinformatician that has recently finished up a postdoc and is trying to get out a final publication. I used two publicly available cohorts as validation in my manuscript. One of the reviewers wants me to analyze an additional four data sets that I was unaware of upon submission. Unfortunately, since I’ve left my institution, I no longer have access to the computing resources I need for reprocessing the raw data. I would like all six cohorts to be processed the same way to feel confident in my results and some of the necessary details are missing from the accessions in ArrayExpress.

System Actions
  1. User must be able to search with a list of experiment accession numbers, either from GEO or SRA or ArrayExpress.

Seventeen

Applicable to: Data Whiz

I'm an ML researcher that works on methods for time series data. I want to know if I can automatically identify time series gene expression data from large compendia. (Inspired by some early Greene Lab projects :) ) I have a handful of data sets that I know are time series as a starting point, but not enough to split data into training or testing.

System Actions
  1. Search terms “time-series”.

Eighteen

Applicable to: Data Whiz

I'm an ML researcher that tends to use methods that are quite sensitive to duplicated data. I want to automatically detect duplicate samples in transcriptomic data without using any additional information (e.g., cancer type). I realize that sometimes what I consider to be duplicates (e.g., samples from the same individual, same tissue) might sometimes be run on different platforms or technologies. I want to find examples of GEO SuperSeries from diverse conditions where the same sample has been run on multiple microarray platforms to craft my duplicate detection method. It will save me time if the SuperSeries are uniformly processed in some way.

System Actions
  1. User must be able to filter search results by whether they are superseries or not.
  2. Search results must indicate if a series is part of a super series. If all the series in a superseries meet the search criteria, only display the superseries.

Nineteen

Applicable to: Data Whiz

I'm a computational biologist that has built a model that accurately predicts the proportion of cells in each phase of the cell cycle using gold standard data (e.g., flow cytometry and RNA-seq data) my lab has generated. Our lab studies angiogenesis, so I want to apply my model to publicly available VEGF time course data. The only data that I can find is microarray data from multiple platforms and I want all of the data, including the RNA-seq data I've generated myself, to be more comparable.

System Actions
  1. Search terms “VEGF”

Twenty-three

Applicable to: Physician

I am a physician that has a sample from a patient with an unknown condition. I would like to identify samples that are most similar to my one sample. My sample has been normalized and processed by a collaborator, but I still have access to the raw files. I have a table of gene expression values that are mapped to HGNC Gene Symbols. I can use this normalized data to get an idea of what samples are similar, but it would be best if I had the ability to upload my raw files and have them reprocessed in the same manner as the rest of the compendium.

System Actions
  1. User must be able to use a different gene identifiers and still be able to search through the data refinery i.e system should be able to map different gene identifiers to ENSG ids.
  2. Only compare to human samples
  3. Return list of similar samples
  4. If multiple (compression) models (model mill) exist for the human compendium, let me pick which model to use with a tooltip with some guidance; also a reasonable default exists
  5. Ranked list of samples (by some quantitative measure depending on model) that can be expanded to reveal abstract, title, etc. of experiment that sample originated from.

Thirty-five

I am a medulloblastoma researcher that is interested in doing a meta-analysis of all available medulloblastoma data. I would like to run CoGAPS and relate the patterns I find back to histology. I’ve explored the data somewhat on using the medulloblastoma data scope on R2 and I’ve noticed the data come from different platforms. It would significantly speed up my research if both the gene expression data and the histology labels were normalized in some way.

System Actions
  1. Search with accession numbers.
  2. Download the data.

Thirty-seven

I am a researcher studying the role of high fat diets. I just read a manuscript ( Kwon EY, Shin SK, Cho YY, Jung UJ et al. Time-course microarrays reveal early activation of the immune transcriptome and adipokine dysregulation leads to fibrosis in visceral adipose depots during diet-induced obesity. BMC Genomics 2012 Sep 4;13:450. PMID: 22947075 ). I would like to download all of the data associated with the manuscript to perform my own secondary analysis to confirm the researchers' findings in data-refinery processed data.

System Actions
  1. Search with publication information.
  2. Download the data.

Solution or next step

@jaclyn-taroni, @Miserlou, and I will have a whiteboarding session to outline workflows and layout.

dvenprasad commented 6 years ago

Stuff that came out of today's whiteboarding session:

Filters:

Search result- requirements

Things not discussed:

Behavior expectations for

[Notes for me] Define behaviors for

Miserlou commented 6 years ago

Documenting from discussion: The current plan for actually building a dataset, samples and experiments have unique associations, so a sample must be added to a dataset multiple times if it's going to be used in multiple experiments. Duplicates will be removed in the species aggregation but included in each experiment with an association. There is not yet an experiment-experiment relation other than in the description text provided to us.

dvenprasad commented 6 years ago

Summary of Low-Fi Feedback:

Low-Fi Screens:

search-page-default search-page-matching-text-expanded

search-results-nonagg-added search-result-super-series search-results-sub-series-multivalues search-result-accession-pmid

ramenhog commented 6 years ago

@dvenprasad and I chatted about the responsive design of this page and came to the conclusion to hide the description/details for each experiment on the search page on mobile, at least for the Keytar Kurt iteration. It's just too much information to be viewed effectively on a mobile device. Users will have to make decisions based on the "factoid bar" and click to the individual experiment page if they want to see more details.

dvenprasad commented 6 years ago

Summary of high Fi feedback:

search_results

dvenprasad commented 6 years ago

Invision Prototypes: Desktop: https://invis.io/ZVI3AS2THFN Mobile: https://invis.io/X6I3BB4NAYT -> Search result variations on mobile

The badges are in the MidFi Project's Assets section.

Guidelines and states for checkboxes and numbered badges are in the master.sketch file in https://github.com/AlexsLemonade/refinebio-design

@ramenhog Let me know if there's anything else you need.