ambrosejcarr commented 4 years ago

Appetite: 10

Requirements:

[x] Completed design that enables stories in chanzuckerberg/single-cell#15 (refine & update stories if needed)
[x] First two implementation stages are outlined as sub-epics
[ ] Reviewed and approved by product

Existing issues:

chanzuckerberg/cellxgene#1538
chanzuckerberg/cellxgene#1539
chanzuckerberg/cellxgene#1540
chanzuckerberg/cellxgene#1541

Design work:

Google Drive Folder

colinmegill commented 4 years ago

Draft Design Doc for Gene Sets

This issue is indebted to work done by @sidneybell, @liaprins and @bkmartin, as well as user feedback from @angela and @bruceAranow

Problem & Background

Presently, in cellxgene, scientists have few affordances for managing genes. The application, to date, has been primarily focused on exploring cells. This has also determined the application’s internal data model.

This feature represents, at a high level, the addition of what we are calling ‘gene sets’ to the right sidebar. Gene sets are a list of lists — each list of genes can have a user generated name, and contain a list of genes.

Users must be able to act on both gene sets — adding a set, deleting a set, coloring by the set as a whole, duplicating sets, and individual genes within sets — adding, removing and reordering genes.

Users can presently add genes to the right sidebar, but they are not persisted — on hard reset of the browser, the genes disappear and must be re-added. This solution implies a persistent workspace for genes, which are saved locally as CSV, or to some database solution in the cloud.

Gene sets will provide a more powerful affordance for building up a workspace, and also generate untenably long lists given the heights of the histograms in the present gene UI. Thus individual genes within a gene set must have a ‘collapsed’ state, with a mini histogram (as appears on categories when genes are colored by) or some metric which describes whether or not the gene is expressed at all by the current world of cells. Collapsed state will also include the option to color by individual genes.

Users will have CSVs of genes that they work with, or could generate them from notebooks. Cellxgene should support import of some kind to ease this path in a hosted environment. Locally, the CSV can be edited directly from both cellxgene and loaded up / referenced in a notebook, so this is less of a problem.

User must be able to

Gene set + import + CRUD + export lifecycle

import existing gene sets
- from uns in h5ad
- from csv on the client
create a new gene set, and give it a name, and persist it
gene sets have a sensible datastructure that aligns with scientist expectations regarding genes being in multiple datasets
- https://github.com/chanzuckerberg/cellxgene/issues/1069
edit a gene set
notify users of genes that failed to import
- https://www.figma.com/proto/MWv3Bf7X9vcTYQFJHH9n7d/Gene-set-featurelettes?node-id=7%3A8630&viewport=1490%2C930%2C1&scaling=min-zoom
- https://github.com/chanzuckerberg/cellxgene/issues/1396
delete a gene set
- https://www.figma.com/proto/MWv3Bf7X9vcTYQFJHH9n7d/Gene-set-featurelettes?node-id=7%3A8228&viewport=811%2C829%2C0.5&scaling=min-zoom
create one or more gene sets from a CSV
- https://github.com/chanzuckerberg/cellxgene/issues/1541
create a gene set from differential expression
- https://github.com/chanzuckerberg/cellxgene/issues/989
color by gene set
- https://github.com/chanzuckerberg/cellxgene/issues/571
see summary statistics for entire gene set
download and export a gene set (format TBD)
- https://github.com/chanzuckerberg/cellxgene/issues/1058

Gene CRUD lifecycle

easily scan a large list of densely packed collapsed genes with mini histograms
- https://www.figma.com/proto/MWv3Bf7X9vcTYQFJHH9n7d/Gene-set-featurelettes?node-id=75%3A1152&viewport=2136%2C718%2C0.5&scaling=min-zoom
expand a gene, see large histogram with axes, and brush to select range — this must handle a gene set existing in multiple gene sets intelligent
- https://github.com/chanzuckerberg/cellxgene/issues/1585
- https://www.figma.com/proto/MWv3Bf7X9vcTYQFJHH9n7d/Gene-set-featurelettes?node-id=75%3A33&viewport=2136%2C718%2C0.5&scaling=min-zoom
add, remove, and reorder an individual gene within a geneset
[x] color by gene
[x] plot x y scatterplot by gene

Sharing

share gene sets with other users privately in read only format
- https://github.com/chanzuckerberg/cellxgene/issues/875
share gene sets with other users privately and collaboratively edit them
share gene sets publicly in read only format
- https://github.com/chanzuckerberg/cellxgene/issues/852

ambrosejcarr commented 4 years ago

Proposed staging

1. Publication use case: Gene sets can be added during launch/ingest

Users are notified when gene(s) do not import, including suggestions for common import failures
Users can color cells by average expression of a gene set
Users can plot a gene set by x or y.
Users can brush over the gene set to select a range of expression values
Users can expand a gene set to see and interact with its members
Users can scan a large list of densely packed genes
Users can expand a gene, see large histogram with axes, and brush to select range
Users can access gene sets they created in previous sessions (gene sets persist)
REQ: Gene lists are non-redundant, non-mutually exclusive, and contain only features from var.
REQ: Must handle genes appearing in two sets and being brushed in different ranges which yield a zero intersection.
REQ: Gene lists can have variable length.
REQ: List names are strings that follow the same conventions as category names.

2. Workspace use case: Users can create a new gene set and give it a name. Users can add, remove, and reorder genes within gene sets

Users can delete/remove a gene set

3. Users can import and export gene sets using the client

Users can download a gene set in an appropriate file format
Users can upload a gene set in csv format, and errors are handled in an informative way (see msigdb.org)
*Users attempting to import genes or gene sets that are invalid receive interpretable eror messages
SPIKE: does supporting compressed csv improve responsiveness of import?

4. Differential expression creates a gene set

When differential expression is calculated, a new geneset is created with a logical name derived from the categories that differential expression was calculated on.

5. Users can share gene sets with others

… publicly in read only format
… privately in read only format
… privately, and collaboratively edit them.

ambrosejcarr commented 4 years ago

Use cases.

from clusters

I'm on cellxgene, have my dataset. It's been clustered. I don't know what I'm looking at. I Want to understand what cells are present and what stimuli they are responding to.
I know the tissue, can look up gene sets for gene programs, tissue types, stimuli.
I want to import and try a bunch, then pare down to just the sets that meaningfully mark my data.
I want to be able to inspect the sets to know which genes contribute the most to the signal, and remove genes that aren't relevant.

manually building

I am doing DE, looking at google, gene cards, playing with markers I know. I add a gene or two. I want to delete ones that turn out to be insufficiently specific. This is hard work, I need to spread it over multiple sessions. I want my sets to persist. I want to share them when I'm done.

from differential expression

I run differential expression. I want to create a gene set from the differential expression results.
(Prediction use case) What do I name my set? I need to know what gene programs correlate with the genes that are identified so I can understand why these genes are coming up.

for publication

I've fully analyzed my data. I have a bunch of gene programs that I've identified that mark cells or stimuli.
I want scientists looking at my data to be able to analyze data using these sets. Users should be able to paint by the sets. They should be able to see how individual genes in the set contribute to the total.
(heatmap use case) I should be able to see how genes in the set covary across cells.

colinmegill commented 4 years ago

Publication use case

colinmegill commented 4 years ago

Workspace use case

ambrosejcarr commented 4 years ago

What does the "create new gene set" flow look like?

I think you mentioned before that the + and ... might be hidden until mouseover -- I think I prefer that; this looks a bit busy relative to the left sidebar.

ambrosejcarr commented 4 years ago

Comment from Jonah:

My curiosity is about how/if we can precompute some standard gene sets (whether they be cells or pathways) that can be quickly referenced or imported. For example, many people want to see where myc signaling is active or Wnt ligands target genes etc. Heard him say that sharing is a future feature and this admittedly falls in the intermediate area.

signechambers1 commented 4 years ago

Great demo during sprint review @colinmegill! A few questions for you or @ambrosejcarr:

If a user wants to add a single gene (eg the current workflow in the right sidebar), is the expected workflow now to create a gene set with one gene in it?
You mentioned a user can reorder genes within a gene set, can the user reorder the genesets? This might be more of a heatmap use case (thinking back to our convo with the Krasnow lab yesterday) and something we would tackle then.
If we expect ~100 lists of ~100 genes each, I think a search function for genesets and genes would be helpful, what do you think?
Downloading a gene set seems like a less common use case but I could imagine it (ie sharing a geneset with collaborators, importing into other tools, etc.). Has this come up in discussions at all?

ambrosejcarr commented 4 years ago

Great questions @signechambers1. For this one:

Downloading a gene set seems like a less common use case but I could imagine it (ie sharing a geneset with collaborators, importing into other tools, etc.). Has this come up in discussions at all?

Your intuition is good. The explanation for why it's missing is procedural. I only requested Colin mock up the first two use cases so we don't get too far ahead of the implementation team. If you check the second comment in this issue, Import & export are the third, followed by linking to differential expression, and finally on-platform sharing.

ambrosejcarr commented 4 years ago

@colinmegill I believe this issue closes #1538, #1539, #1541, #571, and #852. Do you agree?

I think #923 relates to both the question Signe and I asked about what happens to "add genes" and how does a user creates a new gene set, and can be closed when those questions are answered. I know you cited that the work was optional for the publication use case, but it probably needs to be implemented for the workspace use case, right?

How do you propose to address multi-brushing of gene sets (#1584)? Track which genes have been brushed and disable re-brushing with some kind of visual cue to let the user know? I think the publication use case will need to address this issue.

Do you intend for users to be able to re-order genes within gene sets, and gene sets in the workspace use case in one of these use cases or reserve it for later? Where should I put that requirement? When that requirement is accounted for, I believe #1069 can be closed.

colinmegill commented 4 years ago

@signechambers1 great questions!

Yes, that's correct. All genes have to be in a gene set to make the sidebar skim-able and collapse-able. Feedback from users was that we should optimize for many large sets, which does slightly de-optimize for 'quick look', though it's not much slower.
Hadn't considered re-ordering genesets! I assume that'd be useful. I expect users will want to break them into sections and give them headings, as well. In the case of Tabula Muris and the mocks above:

It would mean a heading and sets something like:

Fat immune_nk immune_b mesenchymal_progenitor

Heat and Aorta ...

- I think of this less as search than as filter, but yes, I agree.
- I think this is a later convenience feature (alphabetical sort lets the brain filter quickly).
- This probably is a fuzzy search on set name and membership
- (ie., I can type atrpd and get sets with Aorta or APOD) that would reduce the gene sets that are visible — for an example, see Spotify's search within playlist, though their filter is not as forgiving to errors as I'd like for this given how complicated the names are.
- Would probably be an ever-present, though thin, search bar below the Create new gene set button but above the first set, Spotify also a nice model for that
Yes, scientists want both ingress and egress via csv drop in and see set / csv download.

colinmegill commented 4 years ago

A note: I don't think the data structure should encode the display, so I would still propose we persist gene sets as lists of lists even if we have a heading. The client side could sort them by heading or alphabetically by looking at a heading attribute on the geneset, but I would rather that attribute exist on the geneset.

colinmegill commented 4 years ago

@ambrosejcarr re: what issues are closed, yes, all of those, except #1541 and #1584, which will need to be addressed separately.

colinmegill commented 4 years ago

Creating a gene set:

Adding a gene will also occur in a modal, triggered by the plus button on a gene set:

It will need to encompass adding genes singularly and in bulk, and nice error handling when adding 100's of genes with multiple types of errors.

colinmegill commented 4 years ago

Figma: https://www.figma.com/file/mvGRRSx0fcgBKkLAm3cMJk/Gene-Sets-v0.1?node-id=0%3A1

signechambers1 commented 4 years ago

Closing out design portion from Sprint 1, Colin has linked final figma design above.

Decisions made: The following features that are out of scope for initial staging:

Search/filter functionality for genes and gene sets (Spotify search bar is an example)
Re-ordering gene sets
Providing way for a user to enter info and view info about why a gene is included in a gene set
Including standard / precomputed gene sets

Open product questions to answer in a future sprint:

Do we calculate the expression rate of a gene set based on average count (original idea) or provide other options? (Valentine Svensson example - he could include his own distribution as part of the dataset)
How are we persisting gene sets? (completely in browser, as part of a cookie, or requires login)
What is the workflow for uploading gene sets in bulk in hosted cellxgene?

chanzuckerberg / cellxgene

Gene sets design & implementation staging #1732

Draft Design Doc for Gene Sets

Problem & Background

User must be able to

Gene set + import + CRUD + export lifecycle

see summary statistics for entire gene set

Gene CRUD lifecycle

Sharing

1. Publication use case: Gene sets can be added during launch/ingest

2. Workspace use case: Users can create a new gene set and give it a name. Users can add, remove, and reorder genes within gene sets

3. Users can import and export gene sets using the client

4. Differential expression creates a gene set

5. Users can share gene sets with others

from clusters

manually building

from differential expression

for publication