Search Page Iteration: Scope, Requirements, Design Decisions, and Mockups

dvenprasad commented 6 years ago

Context

This is to define scope and requirements for Search page design iteration.

Problem or idea

Relevant Features for Search page

Search by
- Free text Search (Use Cases: 1, 17, 19) Match titles, descriptions, ~~and metadata~~
- Accession ID/ List of accession ids (Use Cases: 1, 35) The format of the accession ids varies from source to source. Users must be able to search with accession ids regardless of the format of the ID. They should also be able to use sample ID to search.
- PubMed ID (Use Cases: 37)
- ~~Publication information (Use Cases: 37)~~ ( Do not have enough information about authors and publications to support this feature yet)
- Publication Title
Filter by
- Organism (Use Cases: 3) User should be given an option to filter by organism while initiating search.
- Technology (Use Case: 3, 7)
- ~~Platform (Use Cases: 3)~~ (Because platform needs to normalized and mapped)
- ~~Submitter's institution (Use Cases: 3)~~ ( It is more useful when users are able to search with publication information. Removing it from scope)
- Has publication? (Use Cases: 1)
Actions
- Add to dataset/ Add multiple results to dataset
- Download now For aggregatable datasets: Take users to download page For non-aggregatable datasets: Link them to source.
- Add all search results on a page to dataset
- View experiment and sample details
Display
- Accession IDs in format 'GSE-'
- The sample IDs are not uniformly formatted. For the samples where an ID is available, it should be displayed in the format 'GSM-'
- ~~Indicate if a search result is a subseries and link to its superseries~~ ( From @Miserlou's comment ->"There is not yet an experiment-experiment relation other than in the description text provided to us." We will be able to indicate but it is not optimal to provide information about subseries.)
- Show superseries as search result and not its subseries if it ranks higher in the search. Indicate if a search result is a SuperSeries.

Use Cases

One

Applicable to: Data Whiz

I’m an ML researcher that specializes in natural language processing in biomedical literature. I’ve analyzed all papers in PubMed that have a Gene Expression Omnibus (GEO) data set connected to them. I've analyzed things like the co-occurrence of disease labels and pathway names (e.g., KEGG). I’m wondering if I can predict the “behavior” of the pathways in the GEO dataset from the text. For example, can I tell if a pathway will be downregulated in disease samples compared to controls based on the text of a paper and what might the text tell me about the magnitude of that change?

Consequently, I have a list of data sets that I know contain healthy controls for comparison based on my analysis of the submitter-supplied GEO abstract/description text.

I want to use multiple kinds of tools for pathway analysis and I have come up with a list of genes that I want to study based on the pathways I’m interested in.

I only want data sets that fit the following criteria: This data set "covers" 90% of the genes in my list I want to study. All data sets should have a publication associated with them.

System Actions

User must be able to search with a list of accession numbers. (e.g comma separated values)
User must be able to filter by providing a list of genes and set a value for tolerance of missing genes. The system should search results based on genes present in the sample per dataset and not based on the genes that can be measured on a platform.
Filter by whether a dataset is associated with a publication or not.

Two

Applicable to: Data Whiz

I’m a computational biologist that wants to predict cell line passage number and cell line identity from gene expression data. I have generated my own gold standards from the following breast cancer cell lines: MCF-7, SkBr3, BT-483 and BT-474. I only want gene expression data that are reported to be samples from these cell lines for validation. I'm particularly interested in data sets where passage number is reported in the sample metadata. I would also love to provide some summary statistics about usage of different breast cancer cell lines in publicly available transcriptomics data (e.g., 45% of samples are from MCF-7) in the introduction of my paper if possible.

System Actions

User must able to do a full-text search of the abstract and experiment name with key terms like cell-line name or terms like “breast cancer cell-line” or a list of cell-lines.
User must be able to select certain search results to download later (shopping cart) based on their expert knowledge.

Three

Applicable to: Data Whiz

I’m a developmental biologist but I’m comfortable working with single cell RNA-seq (scRNA-seq), bulk RNA-seq and microarray data. I work with C. elegans. I’ve done a single cell RNA-seq experiment in a particular line/mutant that has an expansion in one cell lineage and those "extra" cells also demonstrate a change in expression patterns as compared to the wild type lineage (in adult worms).

I want to use bulk (whole worm) RNA-seq data from different labs that sample different stages of development for validation and exploratory analyses. I know that this data is out there based on my review of the literature.

If I can find microarray data sets of relevant mutants (relevance is determined based on my expert knowledge), that would really help take my research to the next level (assume I have an awesome method for cell type deconvolution, but it’s only appropriate for use with data from one microarray platform).

Also, I have a grant submission due soon and to demonstrate relevance to human health with some preliminary data, I want to look at whether these changes in expression patterns are present in whole tissue gastrointestinal tract biopsies in disease.

System Actions

User must be able to filter search results by data-type(i.e RNA-seq, microarray), source organ(e.g cell-lines, blood), organism, disease, platform, and by source of the data( i.e submitter institution)

Seven

Applicable to: Data Whiz

I’m a bioinformatician that has recently finished up a postdoc and is trying to get out a final publication. I used two publicly available cohorts as validation in my manuscript. One of the reviewers wants me to analyze an additional four data sets that I was unaware of upon submission. Unfortunately, since I’ve left my institution, I no longer have access to the computing resources I need for reprocessing the raw data. I would like all six cohorts to be processed the same way to feel confident in my results and some of the necessary details are missing from the accessions in ArrayExpress.

System Actions

User must be able to search with a list of experiment accession numbers, either from GEO or SRA or ArrayExpress.

Seventeen

Applicable to: Data Whiz

I'm an ML researcher that works on methods for time series data. I want to know if I can automatically identify time series gene expression data from large compendia. (Inspired by some early Greene Lab projects :) ) I have a handful of data sets that I know are time series as a starting point, but not enough to split data into training or testing.

System Actions

Search terms “time-series”.

Eighteen

Applicable to: Data Whiz

I'm an ML researcher that tends to use methods that are quite sensitive to duplicated data. I want to automatically detect duplicate samples in transcriptomic data without using any additional information (e.g., cancer type). I realize that sometimes what I consider to be duplicates (e.g., samples from the same individual, same tissue) might sometimes be run on different platforms or technologies. I want to find examples of GEO SuperSeries from diverse conditions where the same sample has been run on multiple microarray platforms to craft my duplicate detection method. It will save me time if the SuperSeries are uniformly processed in some way.

System Actions

User must be able to filter search results by whether they are superseries or not.
Search results must indicate if a series is part of a super series. If all the series in a superseries meet the search criteria, only display the superseries.

Nineteen

Applicable to: Data Whiz

I'm a computational biologist that has built a model that accurately predicts the proportion of cells in each phase of the cell cycle using gold standard data (e.g., flow cytometry and RNA-seq data) my lab has generated. Our lab studies angiogenesis, so I want to apply my model to publicly available VEGF time course data. The only data that I can find is microarray data from multiple platforms and I want all of the data, including the RNA-seq data I've generated myself, to be more comparable.

System Actions

Search terms “VEGF”

Twenty-three

Applicable to: Physician

I am a physician that has a sample from a patient with an unknown condition. I would like to identify samples that are most similar to my one sample. My sample has been normalized and processed by a collaborator, but I still have access to the raw files. I have a table of gene expression values that are mapped to HGNC Gene Symbols. I can use this normalized data to get an idea of what samples are similar, but it would be best if I had the ability to upload my raw files and have them reprocessed in the same manner as the rest of the compendium.

System Actions

User must be able to use a different gene identifiers and still be able to search through the data refinery i.e system should be able to map different gene identifiers to ENSG ids.
Only compare to human samples
Return list of similar samples
If multiple (compression) models (model mill) exist for the human compendium, let me pick which model to use with a tooltip with some guidance; also a reasonable default exists
Ranked list of samples (by some quantitative measure depending on model) that can be expanded to reveal abstract, title, etc. of experiment that sample originated from.

Thirty-five

I am a medulloblastoma researcher that is interested in doing a meta-analysis of all available medulloblastoma data. I would like to run CoGAPS and relate the patterns I find back to histology. I’ve explored the data somewhat on using the medulloblastoma data scope on R2 and I’ve noticed the data come from different platforms. It would significantly speed up my research if both the gene expression data and the histology labels were normalized in some way.

System Actions

Search with accession numbers.
Download the data.

Thirty-seven

I am a researcher studying the role of high fat diets. I just read a manuscript ( Kwon EY, Shin SK, Cho YY, Jung UJ et al. Time-course microarrays reveal early activation of the immune transcriptome and adipokine dysregulation leads to fibrosis in visceral adipose depots during diet-induced obesity. BMC Genomics 2012 Sep 4;13:450. PMID: 22947075 ). I would like to download all of the data associated with the manuscript to perform my own secondary analysis to confirm the researchers' findings in data-refinery processed data.

System Actions

Search with publication information.
Download the data.

Solution or next step

@jaclyn-taroni, @Miserlou, and I will have a whiteboarding session to outline workflows and layout.

dvenprasad commented 6 years ago

Stuff that came out of today's whiteboarding session:

Be able to add subset of samples to dataset.

Filters:

Auto-apply Filters

Search result- requirements

Adding an experiment to dataset as a CTA(call to action) should be given priority over adding a subset of samples.
Empower users to have enough information to build datasets from the search results page in most cases.
Search results is experiment information oriented- because at this point, we have relatively good data at experiment level than at sample level.

Things not discussed:

Behavior expectations for

Sub-series Result
Super-series Result

[Notes for me] Define behaviors for

aggregatable search Result
Non-aggregatable search Result
Search result - added to dataset
~~Search result - already hit download now in the same session~~ (not providing an option of download now)

Miserlou commented 6 years ago

Documenting from discussion: The current plan for actually building a dataset, samples and experiments have unique associations, so a sample must be added to a dataset multiple times if it's going to be used in multiple experiments. Duplicates will be removed in the species aggregation but included in each experiment with an association. There is not yet an experiment-experiment relation other than in the description text provided to us.

dvenprasad commented 6 years ago

Summary of Low-Fi Feedback:

Remove Technology and Agg/Non-Agg badges
Remove Submitter's Institution from Filters
Modify label on Add all to Dataset button to "Add page to dataset"
Not handling super-series on UI for this iteration
Remove applied filters component -> move it to post-keytar kurt
Clear All button for filters.

Low-Fi Screens:

search-results-nonagg-added search-result-super-series search-results-sub-series-multivalues search-result-accession-pmid

ramenhog commented 6 years ago

@dvenprasad and I chatted about the responsive design of this page and came to the conclusion to hide the description/details for each experiment on the search page on mobile, at least for the Keytar Kurt iteration. It's just too much information to be viewed effectively on a mobile device. Users will have to make decisions based on the "factoid bar" and click to the individual experiment page if they want to see more details.

dvenprasad commented 6 years ago

Summary of high Fi feedback:

Add variation of result card-> A subset of samples from an experiment has been added
Different iconography for platform depending on the technology
No collapse on clicking 'Add to dataset'
Divide the factoid bar to 3 columns so that there is more width to accommodate cases where there are multiple values for organism or platform, with less chances of wrapping.

search_results

dvenprasad commented 6 years ago

Invision Prototypes: Desktop: https://invis.io/ZVI3AS2THFN Mobile: https://invis.io/X6I3BB4NAYT -> Search result variations on mobile

The badges are in the MidFi Project's Assets section.

Guidelines and states for checkboxes and numbered badges are in the master.sketch file in https://github.com/AlexsLemonade/refinebio-design

@ramenhog Let me know if there's anything else you need.

AlexsLemonade / refinebio