SCVI integrated embeddings and pre-trained models are incorporated into the Census

Goal

To enable users to readily start atlas-level analysis of Census for any or all cells of an organism.

User stories

Census currently allows access to single-cell data from hundreds of different datasets, cells from one dataset are in a different numerical space as compared to cells from any other dataset.

Therefore, while users can access all of these data, they cannot readily start their analysis to answer scientific questions about cell biology.

Integration aligns the numerical space of all cells, enabling a multitude of user stories. Below are just a selection of some of the most relevant stories that we will fulfill with this project.

I want to perform differential gene expression between any two groups of cells in Census.
I want to perform cell cluster analysis to understand underlying causes for similarities and differences across cell groups. This is usually done directly with the embeddings. An example of this type of analysis is here.
I want to project my dataset onto the Census embeddings for inference workflows. Embedding generation of non-Census data, cell type prediction, tissue prediction, cluster analysis.
I want to obtain an integrated expression matrix for a slice of Census data for downstream analysis.
I want to visualize cell clusters in a scatter plot for any slice of Census data. Gene expression exploration on clusters, cluster analysis, etc.

Approach

We are to accomplish the goal by providing scVI-based integrated embeddings in the Census SOMA data along with the trained model.

As detailed in this document, and for the first iteration of project, at a high-level we need to create a workflow that can be manually triggered behind the following tasks:

An scVI model is trained, to be done on latest LTS build
[Stretch] An scVI model is fine-tuned or retrained, to be done on all weekly data.
The model is used to generate embeddings
The embeddings are incorporated into, and published with, the Census SOMA object.
The model is accessible for download and re-use.

KRs

The LTS Census build, projected to release in November, has scVI latent spaces in obsm for both the mouse and human SOMA.Experiment.
The model associated with the Census build is available for download.
Notebooks that demonstrate user stories: Accessing embeddings and making umap plot on a slice of data, differential gene expression, getting normalized gene expression matrix for a slice of Census, non-census data projection, cell type prediction.

Assumptions and Risks

Assumptions

scVI produces biological relevant models for Census data (discovery work ongoing).
scVI code for modeling is ready for production use with minimal re-engineering.
scVI training and tuning can be done within CZI’s computing constraints.

Risks

The models may not produce true biological integration, and thus it may not be usable in the first iteration. Current ongoing discovery work is addressing this.
Engineering work to accommodate or improve scVI is too large to do by CZI or in one quarter.
Compute resources at CZI are not sufficient to train the model or perform the hyper-parameter search space.

Plan

Important notes about the plan

All the steps below are performed using the AnnData workflow as it is the least risky path albeit its compute intensity.
As a reach goal we’ll assess the possibility of using the Census PyTorch loaders.
All of the processes below should be CLI-driven, stand-alone python scripts to execute the workflows. Usually the only input for these should be a YAML config to set the params (e.g, census S3 URL, plus the other params from your the workflows)

View high-resolution schematic at FigJam here

Private Zenhub Image

To create a process to train an scVI model across all unique human and mouse cells.

Pre-process data (e.g. select primary data, filter cells/genes, subset to highly variable genes).
Train a full scVI model and save for later use.

To create a process to generate and save embeddings from a trained model across all human and mouse cells

Do a forward pass of all cells through a pre-trained scVI model and extract embeddings (i.e. latent spaces)
Add matrix of embeddings to Census data under obsm[“scvi”] (requires schema change). Add model provenance, e.g. a scvi_summary data frame with relevant information about the run.
Perform lightweight QA to check for integrity of embeddings.

To create a process to save and expose the trained/fine-tuned model for API access

The model used for embedding generation is stored somewhere accessible by users via the API
The model torch model should be saved via torch.save and be available for loading via torch.load. Mimic scVI’s save and load
A notebook should demonstrate examples to load and use the model, in particular how to obtain latent spaces on user’s data.

[STRETCH] To create a process to find best hyper-parameters for an scVI model across all unique human and mouse cells.

Run the hyperparameter search and create a report. Automatically select the best combination of parameters.

[STRETCH] To create a process to fine-tune an scVI model across new unique human and mouse cells not previously seen by the model.

Pre-process newly added data (e.g. select primary data, filter cells/genes, subset to highly variable genes).
Fine-tune scVI model and save it.

Note: The need for fine-tuning should depend upon:
- Whether the cost/time of doing a full training weekly is prohibitive.
- Whether a fine-tuned model will improve materially with new weekly data.

[STRETCH] To Assess/prototype usage of Census PyTorch loader for use in the training pipeline

Provide a delineation of the work necessary to translate the anndata workflow into the PyTorch workflow.

chanzuckerberg / single-cell