chanzuckerberg / cellxgene

An interactive explorer for single-cell transcriptomics data
https://chanzuckerberg.github.io/cellxgene/
MIT License
627 stars 116 forks source link

Data publication use case (request for comment) #875

Closed bkmartinjr closed 4 years ago

bkmartinjr commented 5 years ago

This is a proposed use case for public comment. We are considering adding this in a future release. If you have comments, please leave them, or just give us a thumbs up/down reaction. Thanks!

Overview

Several collaborators have started using cellxgene to publish read-only datasets on the web, as companions to papers or projects. Examples:

These deployments of cellxgene all share common characteristics:

This usage model is not officially supported by cellxgene, and has a number of practical challenges:

  1. Challenging to deploy - it requires that the team include a software engineer who has the requisite skills to wrangle cloud infra & docker/kubernetes (note: this is somewhat fixed by the Heroku template, but at the cost of significantly increased expense).
  2. Expensive - it requires long-running EC2 instances (or worse, even larger Heroku costs), which is difficult to justify for very long-lived publication data hosting.
  3. Non-scalable - these solutions do not autoscale in any practical way, and the back-end is primarily constructed for desktop (single-user) deployments.
  4. Fragile - the back-end is built upon a testing/development web server (Flask/Werkzeug dev server), which strongly recommends against deployment as a general purpose web server (link).

This usage has a number of characteristics we can utilize to overcome the challenges:

Concept Sketch

Core idea: publish datasets with a serverless version of cellxgene, ie, a version of the app that works with no cellxgene back-end, and is dedicated to a single dataset:

Benefits:

Additional, longer-term requirements: there are several additional feature requirements that will need to be included to make this solution robustly address known publication use cases. These features can be implemented over time, and are most likely useful in other use cases / user modes:

Implementation Sketch

Briefly, the publish command would:

Infra changes required:

Issues to consider: this model requires that the data be served via HTTP or equivalent, and will not work via the file:// URI (ie, direct file access) due to CORS restrictions. Practical implication is that this will not allow the distribution via Dropbox or equivalent, as those do serve the data using HTTP. Distribution via S3 sites or other static hosting solutions will work fine.

Miscellaneous

Related ideas/comments:

dburkhardt commented 5 years ago

A major use case for us is being able to provide a browser for our less-computationally inclined collaborators. Typically we identify some sets of clusters, create an embedding, and they want to examine expression of marker genes and identify potentially interesting groups of cells.

However, some of these collaborators are not familiar with the command line. Having a way for them to interact with their data without needing to learn the command line is ideal.

My concern about this static hosting strategy outlined is the loss of abilities to create manual annotations and identify differentially expressed genes. They like using Loupe because it's easy to open and interact with straight away. I know implementing cellxgene as an electron app is not very lightweight, but I'm concerned about losing the data exploration tools.

bkmartinjr commented 5 years ago

@dburkhardt - thanks for the comments!

I know implementing cellxgene as an electron app is not very lightweight, but I'm concerned about losing the data exploration tools.

We have a (non-electron) solution for this use case in the works. Essentially a "native" app that can run on Win/Mac and maybe Linux if there is demand. It is still pre-release, and not yet committed, but if you are interested in trialing, please reach out to @csweaver

olgabot commented 5 years ago

In terms of naming, the hardest problem, the terms freeze package or bundle sound good to me. pickle, serialize, flatten are pretty software engineering specific and I think may be alienating to biologist users. At least, when I use those terms around them, they usually laugh (especially with "pickle") and get confused. package or bundle make conceptual sense I think.

cornhundred commented 5 years ago

I'm not sure whether this is applicable to Cellxgene since it does not run in Jupyter yet (https://github.com/chanzuckerberg/cellxgene/blob/master/ROADMAP.md#python-api) I think, but for our Clustergrammer2 (example-notebooks and dashboards) project we're hosting interactive single-cell visualizations as:

MyBinder provides free compute (with limited RAM). GitHub provides free hosting (limited dataset size of <100MB per file). NBViewer provides free rendering of notebooks on GitHub.

Let us know if this is useful :)

mxposed commented 5 years ago

This is a great sketch, thank you for working on this. Is there something to be done to help implement this?

colinmegill commented 5 years ago

Hi @mxposed! Thanks for your offer to contribute. You can join the open slack for cellxgene here: https://join-cellxgene-users.herokuapp.com/ to collaborate

joshua-gould commented 5 years ago

It would also be good to think about enabling users to add authentication and authorization hooks. For example, I can put my static file in a bucket and make them readable by a list of authorized users. The cellxgene interface would ideally provide a button to login/logout to a provider of choice (e.g. Google).

msmicker commented 4 years ago

Would there be any benefit to having a capability added on the server backend to feed the dataset based on a url parameter rather than server startup? We have easily integrated another tool with this capability. e.g. on host a base directory might be specified upon server startup. h5ad paths could be provided as relative references to that base path by url parameter.

e.g. https://host/cellxgene?dataset=sample-dataset/file.h5ad or if there is a conversion (upon upload of datasets) to html/static files, then it could be reference to a directory holding files matching the proper spec.

chris-rands commented 4 years ago

This seems very useful- are there plans to implement the cellxgene publish functionality in a particular timeframe or is this still in the discussion phase? Thank you!

bkmartinjr commented 4 years ago

We are still in discussion phases. The team is primarily focusing on finishing annotations at the moment. We plan to revise & re-publish our entire roadmap by end of year.

signechambers1 commented 4 years ago

Closing this request for comment, parts of this implementation will be covered by Publishers want to publish a private collection