chanzuckerberg / cellxgene

An interactive explorer for single-cell transcriptomics data
https://chanzuckerberg.github.io/cellxgene/
MIT License
631 stars 118 forks source link

Gene sets design & implementation staging #1732

Closed ambrosejcarr closed 4 years ago

ambrosejcarr commented 4 years ago

Appetite: 10

Requirements:

Existing issues:

Design work:

colinmegill commented 4 years ago

Draft Design Doc for Gene Sets

This issue is indebted to work done by @sidneybell, @liaprins and @bkmartin, as well as user feedback from @angela and @bruceAranow

Problem & Background

Presently, in cellxgene, scientists have few affordances for managing genes. The application, to date, has been primarily focused on exploring cells. This has also determined the application’s internal data model.

This feature represents, at a high level, the addition of what we are calling ‘gene sets’ to the right sidebar. Gene sets are a list of lists — each list of genes can have a user generated name, and contain a list of genes.

Users must be able to act on both gene sets — adding a set, deleting a set, coloring by the set as a whole, duplicating sets, and individual genes within sets — adding, removing and reordering genes.

Users can presently add genes to the right sidebar, but they are not persisted — on hard reset of the browser, the genes disappear and must be re-added. This solution implies a persistent workspace for genes, which are saved locally as CSV, or to some database solution in the cloud.

Gene sets will provide a more powerful affordance for building up a workspace, and also generate untenably long lists given the heights of the histograms in the present gene UI. Thus individual genes within a gene set must have a ‘collapsed’ state, with a mini histogram (as appears on categories when genes are colored by) or some metric which describes whether or not the gene is expressed at all by the current world of cells. Collapsed state will also include the option to color by individual genes.

Users will have CSVs of genes that they work with, or could generate them from notebooks. Cellxgene should support import of some kind to ease this path in a hosted environment. Locally, the CSV can be edited directly from both cellxgene and loaded up / referenced in a notebook, so this is less of a problem.

User must be able to

Gene set + import + CRUD + export lifecycle

Gene CRUD lifecycle

Sharing

ambrosejcarr commented 4 years ago

Proposed staging

1. Publication use case: Gene sets can be added during launch/ingest

2. Workspace use case: Users can create a new gene set and give it a name. Users can add, remove, and reorder genes within gene sets

3. Users can import and export gene sets using the client

4. Differential expression creates a gene set

5. Users can share gene sets with others

ambrosejcarr commented 4 years ago

Use cases.

from clusters

manually building

from differential expression

for publication

colinmegill commented 4 years ago

Publication use case

image

colinmegill commented 4 years ago

Workspace use case

Screen Shot 2020-08-26 at 6 33 07 PM
ambrosejcarr commented 4 years ago

What does the "create new gene set" flow look like?

I think you mentioned before that the + and ... might be hidden until mouseover -- I think I prefer that; this looks a bit busy relative to the left sidebar.

ambrosejcarr commented 4 years ago

Comment from Jonah:

My curiosity is about how/if we can precompute some standard gene sets (whether they be cells or pathways) that can be quickly referenced or imported. For example, many people want to see where myc signaling is active or Wnt ligands target genes etc. Heard him say that sharing is a future feature and this admittedly falls in the intermediate area.

signechambers1 commented 4 years ago

Great demo during sprint review @colinmegill! A few questions for you or @ambrosejcarr:

  1. If a user wants to add a single gene (eg the current workflow in the right sidebar), is the expected workflow now to create a gene set with one gene in it?

  2. You mentioned a user can reorder genes within a gene set, can the user reorder the genesets? This might be more of a heatmap use case (thinking back to our convo with the Krasnow lab yesterday) and something we would tackle then.

  3. If we expect ~100 lists of ~100 genes each, I think a search function for genesets and genes would be helpful, what do you think?

  4. Downloading a gene set seems like a less common use case but I could imagine it (ie sharing a geneset with collaborators, importing into other tools, etc.). Has this come up in discussions at all?

ambrosejcarr commented 4 years ago

Great questions @signechambers1. For this one:

  1. Downloading a gene set seems like a less common use case but I could imagine it (ie sharing a geneset with collaborators, importing into other tools, etc.). Has this come up in discussions at all?

Your intuition is good. The explanation for why it's missing is procedural. I only requested Colin mock up the first two use cases so we don't get too far ahead of the implementation team. If you check the second comment in this issue, Import & export are the third, followed by linking to differential expression, and finally on-platform sharing.

ambrosejcarr commented 4 years ago

@colinmegill I believe this issue closes #1538, #1539, #1541, #571, and #852. Do you agree?

I think #923 relates to both the question Signe and I asked about what happens to "add genes" and how does a user creates a new gene set, and can be closed when those questions are answered. I know you cited that the work was optional for the publication use case, but it probably needs to be implemented for the workspace use case, right?

How do you propose to address multi-brushing of gene sets (#1584)? Track which genes have been brushed and disable re-brushing with some kind of visual cue to let the user know? I think the publication use case will need to address this issue.

Do you intend for users to be able to re-order genes within gene sets, and gene sets in the workspace use case in one of these use cases or reserve it for later? Where should I put that requirement? When that requirement is accounted for, I believe #1069 can be closed.

colinmegill commented 4 years ago

@signechambers1 great questions!

  1. Yes, that's correct. All genes have to be in a gene set to make the sidebar skim-able and collapse-able. Feedback from users was that we should optimize for many large sets, which does slightly de-optimize for 'quick look', though it's not much slower.

  2. Hadn't considered re-ordering genesets! I assume that'd be useful. I expect users will want to break them into sections and give them headings, as well. In the case of Tabula Muris and the mocks above:

Screen Shot 2020-08-31 at 11 13 31 AM

It would mean a heading and sets something like:

Fat immune_nk immune_b mesenchymal_progenitor

Heat and Aorta ...

    • I think of this less as search than as filter, but yes, I agree.
    • I think this is a later convenience feature (alphabetical sort lets the brain filter quickly).
    • This probably is a fuzzy search on set name and membership
    • (ie., I can type atrpd and get sets with Aorta or APOD) that would reduce the gene sets that are visible — for an example, see Spotify's search within playlist, though their filter is not as forgiving to errors as I'd like for this given how complicated the names are.
    • Would probably be an ever-present, though thin, search bar below the Create new gene set button but above the first set, Spotify also a nice model for that
  1. Yes, scientists want both ingress and egress via csv drop in and see set / csv download.

colinmegill commented 4 years ago

A note: I don't think the data structure should encode the display, so I would still propose we persist gene sets as lists of lists even if we have a heading. The client side could sort them by heading or alphabetically by looking at a heading attribute on the geneset, but I would rather that attribute exist on the geneset.

colinmegill commented 4 years ago

@ambrosejcarr re: what issues are closed, yes, all of those, except #1541 and #1584, which will need to be addressed separately.

colinmegill commented 4 years ago

Creating a gene set:

Screen Shot 2020-08-31 at 1 03 30 PM

Adding a gene will also occur in a modal, triggered by the plus button on a gene set:

Screen Shot 2020-08-31 at 5 05 42 PM

It will need to encompass adding genes singularly and in bulk, and nice error handling when adding 100's of genes with multiple types of errors.

colinmegill commented 4 years ago

Figma: https://www.figma.com/file/mvGRRSx0fcgBKkLAm3cMJk/Gene-Sets-v0.1?node-id=0%3A1

signechambers1 commented 4 years ago

Closing out design portion from Sprint 1, Colin has linked final figma design above.

Decisions made: The following features that are out of scope for initial staging:

  1. Search/filter functionality for genes and gene sets (Spotify search bar is an example)
  2. Re-ordering gene sets
  3. Providing way for a user to enter info and view info about why a gene is included in a gene set
  4. Including standard / precomputed gene sets

Open product questions to answer in a future sprint:

  1. Do we calculate the expression rate of a gene set based on average count (original idea) or provide other options? (Valentine Svensson example - he could include his own distribution as part of the dataset)
  2. How are we persisting gene sets? (completely in browser, as part of a cookie, or requires login)
  3. What is the workflow for uploading gene sets in bulk in hosted cellxgene?