DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Evaluate cellxgene locally #1590

Closed hannes-ucsc closed 4 years ago

hannes-ucsc commented 4 years ago

We might be asked to host CZI's cellxgene for the HCA, on the HCA AWS accounts.

The following questions should be answered by installing and running cellxgene locally:

1) How do we get an input file to run it on?

2) How long does it take to initialize the server process, potentially as a function of the input size?

3) What dependencies does it have?

4) Does one instance support multiple input dataset?

5) Can one run multiple instances on one input dataset? The documentation seems to indicate that one cannot.

theathorn commented 4 years ago

A good example: https://www.gutcellatlas.org/tcr-bcr

nadove-ucsc commented 4 years ago
  1. There are three example files available for download from https://chanzuckerberg.github.io/cellxgene/posts/demo-data
nadove-ucsc commented 4 years ago
  1. Average of 10 separate startups with pre-downloaded fies:
                 filename    filesize         startup
0              pbmc3k.h5ad    22189122 00:00:01.393100
3      output_.seurat.h5ad    45788728 00:00:02.525500
1        tabula-muris.h5ad  1178565335 00:00:01.811200
2  tabula-muris-senis.h5ad  3844684983 00:02:12.281500

cellxgene also has a "backed" mode that accelerate file loading and saves memory but slows down analysis once the server is up. I did not test the slowdown but the speedup was remarkable, especially on the largest file:

                  filename    filesize         startup
0              pbmc3k.h5ad    22189122 00:00:00.935800
3      output_.seurat.h5ad    45788728 00:00:00.939000
1        tabula-muris.h5ad  1178565335 00:00:01.930800
2  tabula-muris-senis.h5ad  3844684983 00:00:02.810500
nadove-ucsc commented 4 years ago
  1. In summary:

Complete dependency tree:

name                                             summary
-----------------------------------------------  -----------------------------------------------------------------------
cellxgene                                        Web application for exploration of large scale scRNA-seq datasets
├── Flask-Caching>=1.4.0                         Adds caching support to your Flask application
│   └── Flask                                    A simple framework for building complex web applications.
│       ├── Jinja2>=2.10.1                       A very fast and expressive template engine.
│       │   └── MarkupSafe>=0.23                 Safely add untrusted strings to HTML/XML markup.
│       ├── Werkzeug>=0.15                       The comprehensive WSGI web application library.
│       ├── click>=5.1                           Composable command line interface toolkit
│       └── itsdangerous>=0.24                   Various helpers to pass data to untrusted environments and back.
├── Flask-Compress>=1.4.0                        Compress responses in your Flask app with gzip.
│   └── Flask                                    A simple framework for building complex web applications.
│       ├── Jinja2>=2.10.1                       A very fast and expressive template engine.
│       │   └── MarkupSafe>=0.23                 Safely add untrusted strings to HTML/XML markup.
│       ├── Werkzeug>=0.15                       The comprehensive WSGI web application library.
│       ├── click>=5.1                           Composable command line interface toolkit
│       └── itsdangerous>=0.24                   Various helpers to pass data to untrusted environments and back.
├── Flask-Cors>=3.0.6                            A Flask extension adding a decorator for CORS support
│   ├── Flask>=0.9                               A simple framework for building complex web applications.
│   │   ├── Jinja2>=2.10.1                       A very fast and expressive template engine.
│   │   │   └── MarkupSafe>=0.23                 Safely add untrusted strings to HTML/XML markup.
│   │   ├── Werkzeug>=0.15                       The comprehensive WSGI web application library.
│   │   ├── click>=5.1                           Composable command line interface toolkit
│   │   └── itsdangerous>=0.24                   Various helpers to pass data to untrusted environments and back.
│   └── Six                                      Python 2 and 3 compatibility utilities
├── Flask-RESTful>=0.3.6                         Simple framework for creating REST APIs
│   ├── Flask>=0.8                               A simple framework for building complex web applications.
│   │   ├── Jinja2>=2.10.1                       A very fast and expressive template engine.
│   │   │   └── MarkupSafe>=0.23                 Safely add untrusted strings to HTML/XML markup.
│   │   ├── Werkzeug>=0.15                       The comprehensive WSGI web application library.
│   │   ├── click>=5.1                           Composable command line interface toolkit
│   │   └── itsdangerous>=0.24                   Various helpers to pass data to untrusted environments and back.
│   ├── aniso8601>=0.82                          A library for parsing ISO 8601 strings.
│   ├── pytz                                     World timezone definitions, modern and historical
│   └── six>=1.3.0                               Python 2 and 3 compatibility utilities
├── Flask>=1.0.2                                 A simple framework for building complex web applications.
│   ├── Jinja2>=2.10.1                           A very fast and expressive template engine.
│   │   └── MarkupSafe>=0.23                     Safely add untrusted strings to HTML/XML markup.
│   ├── Werkzeug>=0.15                           The comprehensive WSGI web application library.
│   ├── click>=5.1                               Composable command line interface toolkit
│   └── itsdangerous>=0.24                       Various helpers to pass data to untrusted environments and back.
├── anndata==0.6.22post1                         Annotated Data.
│   ├── h5py                                     Read and write HDF5 files from Python
│   │   ├── numpy>=1.7                           NumPy is the fundamental package for array computing with Python.
│   │   └── six                                  Python 2 and 3 compatibility utilities
│   ├── natsort                                  Simple yet flexible natural sorting in Python.
│   ├── numpy~=1.14                              NumPy is the fundamental package for array computing with Python.
│   ├── pandas>=0.23.0                           Powerful data structures for data analysis, time series, and statistics
│   │   ├── numpy>=1.13.3                        NumPy is the fundamental package for array computing with Python.
│   │   ├── python-dateutil>=2.6.1               Extensions to the standard Python datetime module
│   │   │   └── six>=1.5                         Python 2 and 3 compatibility utilities
│   │   └── pytz>=2017.2                         World timezone definitions, modern and historical
│   └── scipy~=1.0                               SciPy: Scientific Library for Python
│       └── numpy>=1.13.3                        NumPy is the fundamental package for array computing with Python.
├── click>=6.7                                   Composable command line interface toolkit
├── fastobo>=0.6.1                               Faultless AST for Open Biomedical Ontologies in Python.
├── flatbuffers>=1.10.0                          The FlatBuffers serialization format for Python
├── fsspec>=0.4.4                                File-system specification
├── h5py==2.9.0                                  Read and write HDF5 files from Python
│   ├── numpy>=1.7                               NumPy is the fundamental package for array computing with Python.
│   └── six                                      Python 2 and 3 compatibility utilities
├── numpy>=1.15.2                                NumPy is the fundamental package for array computing with Python.
├── pandas>=0.24.2                               Powerful data structures for data analysis, time series, and statistics
│   ├── numpy>=1.13.3                            NumPy is the fundamental package for array computing with Python.
│   ├── python-dateutil>=2.6.1                   Extensions to the standard Python datetime module
│   │   └── six>=1.5                             Python 2 and 3 compatibility utilities
│   └── pytz>=2017.2                             World timezone definitions, modern and historical
├── requests>=2.22.0                             Python HTTP for Humans.
│   ├── certifi>=2017.4.17                       Python package for providing Mozilla's CA Bundle.
│   ├── chardet<4,>=3.0.2                        Universal encoding detector for Python 2 and 3
│   ├── idna<3,>=2.5                             Internationalized Domain Names in Applications (IDNA)
│   └── urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1  HTTP library with thread-safe connection pooling, file post, and more.
├── scipy>=1.3.0                                 SciPy: Scientific Library for Python
│   └── numpy>=1.13.3                            NumPy is the fundamental package for array computing with Python.
└── tables==3.5.1                                Hierarchical datasets for Python
    ├── mock>=2.0                                Rolling backport of unittest.mock for all Pythons
    ├── numexpr>=2.6.2                           Fast numerical expression evaluator for NumPy
    │   └── numpy>=1.7                           NumPy is the fundamental package for array computing with Python.
    ├── numpy>=1.9.3                             NumPy is the fundamental package for array computing with Python.
    └── six>=1.9.0                               Python 2 and 3 compatibility utilities
nadove-ucsc commented 4 years ago
  1. It appears not. Launching fails unless exactly one file is provided on the command line. The command line help and online documentation offer no alternative approaches.
nadove-ucsc commented 4 years ago
  1. It appears we can. I believe this is a typo/grammar flaw in the docs, and "one instance per dataset" is supposed to read "one dataset per instance" (as per point 4).

I was able to launch cellxgene in two parallel bash sessions in the same directory on the same data file. They instantiated at different ports on localhost (127.0.0.1:5005 and 127.0.0.1.5006), and could be manipulated and terminated independently.

nadove-ucsc commented 4 years ago

New questions:

  1. Why is loading tabula-muris.h5ad so fast?
  2. Performance impact of starting in backed mode.
  3. Does hosting on Heroku use backed mode?
  4. For @hannes-ucsc: besides Galaxy and Novartis, are there any other hosting efforts? (Sanger?)
nadove-ucsc commented 4 years ago
  1. Still uncertain. I discovered that the tabula-muris file is extremely repetitive, with only about 1% of the matrix entries representing unique values. However, when I copied the file and overwrote the matrix with random values, loading only slowed down loading by 30%. The filesize also doubled for some reason, meaning that the discrepancy between filesize and load time was actually more extreme for the random matrix.
nadove-ucsc commented 4 years ago
  1. It does not. See https://github.com/chanzuckerberg/cellxgene/blob/heroku/heroku.yml
nadove-ucsc commented 4 years ago
  1. I ran differential gene expression on cohorts of size 85,986 and 65,695. The non-backed instance took 3:38 and the backed instance took 4:21 (20% slowdown). I couldn't find any other operations that took long enough for the time to be measurable, even with the largest dataset.
nadove-ucsc commented 4 years ago

From standup: measure memory usage

nadove-ucsc commented 4 years ago

non-backed: peak usage while loading data: 13.8G peak usage during differential expression: 13.5G resting usage: 2917M backed: peak usage during startup: 538M peak usage during differential expression: 13.2G resting usage: 203M