bkmartinjr commented 5 years ago

This is a proposed use case for public comment. We are considering adding this in a future release. If you have comments, please leave them, or just give us a thumbs up/down reaction. Thanks!

Overview

Several collaborators have started using cellxgene to publish read-only datasets on the web, as companions to papers or projects. Examples:

Kidney cell atlas (kidneycellatlas.org, github)
Tabula Muris Senis (tabula-muris-senis.ds.czbiohub.org)
Atlas of Anopheles hemocytes (https://hemocytes.cellgeni.sanger.ac.uk/)

These deployments of cellxgene all share common characteristics:

Docker or Kubernetes deployment of the Python CLI with no modification. Typically a separate container instance for each data segment/partition, with a long-running EC2 instance hosting the data+app.
Wrapper web site that links out to each separate instance of cellxgene.

This usage model is not officially supported by cellxgene, and has a number of practical challenges:

Challenging to deploy - it requires that the team include a software engineer who has the requisite skills to wrangle cloud infra & docker/kubernetes (note: this is somewhat fixed by the Heroku template, but at the cost of significantly increased expense).
Expensive - it requires long-running EC2 instances (or worse, even larger Heroku costs), which is difficult to justify for very long-lived publication data hosting.
Non-scalable - these solutions do not autoscale in any practical way, and the back-end is primarily constructed for desktop (single-user) deployments.
Fragile - the back-end is built upon a testing/development web server (Flask/Werkzeug dev server), which strongly recommends against deployment as a general purpose web server (link).

This usage has a number of characteristics we can utilize to overcome the challenges:

Data is read-only -- there is no expectation that the user will want to re-annotate the data or perform the iterative workflow typical of desktop cellxgene.
Data is curated -- analysis in the paper has most likely already highlighted interesting marker genes and other metadata, and may have subsetting into smaller data sets. This reduces the value of ad hoc analysis features such as differential expression. Related: issue #852.
Desktop escape hatch -- where desired, the data provider can also provide scientists with an H5AD (or full dataset), enabling download and further analysis analysis using tools of choice (ScanPy, the full cellxgene, etc). The publication web site does not need to be a full-featured analysis workbench.

Concept Sketch

Core idea: publish datasets with a serverless version of cellxgene, ie, a version of the app that works with no cellxgene back-end, and is dedicated to a single dataset:

Usage: a new CLI sub-command, cellxgene publish, converts an H5AD file into a collection of static HTML, JS and binary files. These files contain a static version of the cellxgene application and the data, and can be directly displayed by a browser (without a dedicated server, but still requiring a web server to statically serve the assets).
Deployment of dataset can be accomplished by serving the static files generated by cellxgene publish. Example deployments might include:
- Serving directly from the file system via a local web server or equivalent (browser security restrictions (CORS, etc) do not allow direct serving from a directory via file:// URLs, so this won't solve for the "distribute via dropbox/NAS/gdrive" use cases)
- Storing the files in an S3 bucket, and serving the static website from S3 link.
- Storing on a static hosting service like http://surge.sh or https://www.netlify.com/. (for tiny data sets) github pages or equivalent
There are feature trade-offs -- any features requiring back-end server support will be unavailable/disabled. Currently these are:
- On-demand differential expression of user cell selections
- Features which modify the data, ie, manual annotations

Benefits:

Trivial deployment - create an S3 bucket, or put on any static hosting provider (including internal web servers).
Cost efficient - no long-running EC2 instances or Heroku is required. S3 or equivalent storage of data sets is very cheap.
Scalable - static hosting scales, and if you need plaid mode, you can use a CDN (the files are static).
Robust - static hosting just works

Additional, longer-term requirements: there are several additional feature requirements that will need to be included to make this solution robustly address known publication use cases. These features can be implemented over time, and are most likely useful in other use cases / user modes:

Color palette - publications surrounding the data set will have a well defined color palette (chosen by the paper authors) for graph visualizations (typically, cluster colors, and other color/metadata assignment). It is desirable that the web visualization match the paper publication palette, making it easy to compare the two. This needs to be a configuration option for the cellxgene publish command, configuring the static web files to use paper-publisher-specified palettes.
Pre-load genes - paper authors will likely want to highlight a specific set of genes, automatically load/display these genes at start time (ie, don't force the user to type in the gene names).
Cluster vs. all else comparisons - gene rankings between predefined cell sets, eg, comparing a given cluster to all else, are commonly used to summarize differential expression. Example: sc.pl.rank_genes_groups(). These will be useful as a replacement for fully general differential expression, as common comparisons can be precomputed (eg, louvain cluster 0 vs. all else).
Link to source data - there should be a config option to embed a download link in the info menu. This allows a publisher to optionally embed a link to the original source data in their static cellxgene instance.
Script injection - the current index.html script injection CLI parameter should work with this new mode, supporting the analytics use cases (and others).

Implementation Sketch

Briefly, the publish command would:

Convert all data to flatbuffer-formatted segments, similar to the slices our REST API already use:
- varAnnotations
- obsAnnotations
- X (expression matrix) - likely chunked to keep file size and count in reasonable balanced, eg, 10-100 columns per fbs.
- layouts
Generate static index.html and bundle.js with the configuration bound. Likely to use a combination of the current jinja templates and a JSON payloads matching the current config and schema routes.

Infra changes required:

Large X expression data matrices are typically original from a sparse H5AD format (to save disk space). To enable this solution for very large, sparse datasets, we will need to enhance the FBS matrix format with sparse array capability (compressed column format at a minimum).
The front-end IO layer (actions/index.html) will need to be reworked to support both the REST API and the direct FBS access.

Issues to consider: this model requires that the data be served via HTTP or equivalent, and will not work via the file:// URI (ie, direct file access) due to CORS restrictions. Practical implication is that this will not allow the distribution via Dropbox or equivalent, as those do serve the data using HTTP. Distribution via S3 sites or other static hosting solutions will work fine.

Miscellaneous

Related ideas/comments:

Other sub-command names that have been suggested, as a replacement for "publish": freeze, bundle, package, encapsulate, distill, pickle, serialize, flatten
There are a variety of hosting models being discussed, which could be built on top of this "create publishable data set" tool, for example:
- Query a public data set (eg, HCA) and save a static publish data set
- Create one's own publishable files, and upload to an S3 bucket
- Etc.
Other companion toolchains would be useful for our users, and could be built on top of "publish" (but are separate projects). Examples:
- streamline the host & serve step, maybe by streamlining the S3 route, etc, into a single command. Eg., run cellxgene package and then cellxgene publish mylabwebsite.com and not deal with S3 yourself. Assumes AWS account creds, etc.
Mobile/tablet support. For a public dataset, scientists want to be able to show a colleague while chatting at a conference, for example.

dburkhardt commented 5 years ago

A major use case for us is being able to provide a browser for our less-computationally inclined collaborators. Typically we identify some sets of clusters, create an embedding, and they want to examine expression of marker genes and identify potentially interesting groups of cells.

However, some of these collaborators are not familiar with the command line. Having a way for them to interact with their data without needing to learn the command line is ideal.

My concern about this static hosting strategy outlined is the loss of abilities to create manual annotations and identify differentially expressed genes. They like using Loupe because it's easy to open and interact with straight away. I know implementing cellxgene as an electron app is not very lightweight, but I'm concerned about losing the data exploration tools.

bkmartinjr commented 5 years ago

@dburkhardt - thanks for the comments!

I know implementing cellxgene as an electron app is not very lightweight, but I'm concerned about losing the data exploration tools.

We have a (non-electron) solution for this use case in the works. Essentially a "native" app that can run on Win/Mac and maybe Linux if there is demand. It is still pre-release, and not yet committed, but if you are interested in trialing, please reach out to @csweaver

olgabot commented 5 years ago

In terms of naming, the hardest problem, the terms freeze package or bundle sound good to me. pickle, serialize, flatten are pretty software engineering specific and I think may be alienating to biologist users. At least, when I use those terms around them, they usually laugh (especially with "pickle") and get confused. package or bundle make conceptual sense I think.

cornhundred commented 5 years ago

I'm not sure whether this is applicable to Cellxgene since it does not run in Jupyter yet (https://github.com/chanzuckerberg/cellxgene/blob/master/ROADMAP.md#python-api) I think, but for our Clustergrammer2 (example-notebooks and dashboards) project we're hosting interactive single-cell visualizations as:

static HTML files through NBViewer: https://nbviewer.jupyter.org/github/ismms-himc/clustergrammer2-notebooks/blob/master/notebooks/3.0_2700_PBMC_scRNA-seq.ipynb
Runnable Jupyter notebooks through MyBinder: https://mybinder.org/v2/gh/ismms-himc/clustergrammer2-notebooks/master?filepath=notebooks%2F3.0_2700_PBMC_scRNA-seq.ipynb
Voila Dashboard (from a Jupyter notebook): https://github.com/ismms-himc/codex_dashboard and https://voila-gallery.org/services/gallery/

MyBinder provides free compute (with limited RAM). GitHub provides free hosting (limited dataset size of <100MB per file). NBViewer provides free rendering of notebooks on GitHub.

Let us know if this is useful :)

mxposed commented 5 years ago

This is a great sketch, thank you for working on this. Is there something to be done to help implement this?

colinmegill commented 5 years ago

Hi @mxposed! Thanks for your offer to contribute. You can join the open slack for cellxgene here: https://join-cellxgene-users.herokuapp.com/ to collaborate

joshua-gould commented 5 years ago

It would also be good to think about enabling users to add authentication and authorization hooks. For example, I can put my static file in a bucket and make them readable by a list of authorized users. The cellxgene interface would ideally provide a button to login/logout to a provider of choice (e.g. Google).

msmicker commented 4 years ago

Would there be any benefit to having a capability added on the server backend to feed the dataset based on a url parameter rather than server startup? We have easily integrated another tool with this capability. e.g. on host a base directory might be specified upon server startup. h5ad paths could be provided as relative references to that base path by url parameter.

e.g. https://host/cellxgene?dataset=sample-dataset/file.h5ad or if there is a conversion (upon upload of datasets) to html/static files, then it could be reference to a directory holding files matching the proper spec.

chris-rands commented 4 years ago

This seems very useful- are there plans to implement the cellxgene publish functionality in a particular timeframe or is this still in the discussion phase? Thank you!

bkmartinjr commented 4 years ago

We are still in discussion phases. The team is primarily focusing on finishing annotations at the moment. We plan to revise & re-publish our entire roadmap by end of year.

signechambers1 commented 4 years ago

Closing this request for comment, parts of this implementation will be covered by Publishers want to publish a private collection

chanzuckerberg / cellxgene

Data publication use case (request for comment) #875

Overview

Concept Sketch

Implementation Sketch

Miscellaneous