To enable users to readily start atlas-level analysis of Census for any or all cells of an organism.
User stories
Census currently allows access to single-cell data from hundreds of different datasets, cells from one dataset are in a different numerical space as compared to cells from any other dataset.
Therefore, while users can access all of these data, they cannot readily start their analysis to answer scientific questions about cell biology.
Integration aligns the numerical space of all cells, enabling a multitude of user stories. Below are just a selection of some of the most relevant stories that we will fulfill with this project.
I want to perform differential gene expression between any two groups of cells in Census.
I want to perform cell cluster analysis to understand underlying causes for similarities and differences across cell groups.
This is usually done directly with the embeddings. An example of this type of analysis is here.
I want to project my dataset onto the Census embeddings for inference workflows.
Embedding generation of non-Census data, cell type prediction, tissue prediction, cluster analysis.
I want to obtain an integrated expression matrix for a slice of Census data for downstream analysis.
I want to visualize cell clusters in a scatter plot for any slice of Census data.
Gene expression exploration on clusters, cluster analysis, etc.
Approach
We are to accomplish the goal by providing scVI-based integrated embeddings in the Census SOMA data along with the trained model.
As detailed in this document, and for the first iteration of project, at a high-level we need to create a workflow that can be manually triggered behind the following tasks:
An scVI model is trained, to be done on latest LTS build
[Stretch] An scVI model is fine-tuned or retrained, to be done on all weekly data.
The model is used to generate embeddings
The embeddings are incorporated into, and published with, the Census SOMA object.
The model is accessible for download and re-use.
KRs
The LTS Census build, projected to release in November, has scVI latent spaces in obsm for both the mouse and human SOMA.Experiment.
The model associated with the Census build is available for download.
Notebooks that demonstrate user stories:
Accessing embeddings and making umap plot on a slice of data, differential gene expression, getting normalized gene expression matrix for a slice of Census, non-census data projection, cell type prediction.
Assumptions and Risks
Assumptions
scVI produces biological relevant models for Census data (discovery work ongoing).
scVI code for modeling is ready for production use with minimal re-engineering.
scVI training and tuning can be done within CZI’s computing constraints.
Risks
The models may not produce true biological integration, and thus it may not be usable in the first iteration. Current ongoing discovery work is addressing this.
Engineering work to accommodate or improve scVI is too large to do by CZI or in one quarter.
Compute resources at CZI are not sufficient to train the model or perform the hyper-parameter search space.
Plan
Important notes about the plan
All the steps below are performed using the AnnData workflow as it is the least risky path albeit its compute intensity.
As a reach goal we’ll assess the possibility of using the Census PyTorch loaders.
All of the processes below should be CLI-driven, stand-alone python scripts to execute the workflows. Usually the only input for these should be a YAML config to set the params (e.g, census S3 URL, plus the other params from your the workflows)
To create a process to train an scVI model across all unique human and mouse cells.
Pre-process data (e.g. select primary data, filter cells/genes, subset to highly variable genes).
Train a full scVI model and save for later use.
To create a process to generate and save embeddings from a trained model across all human and mouse cells
Do a forward pass of all cells through a pre-trained scVI model and extract embeddings (i.e. latent spaces)
Add matrix of embeddings to Census data under obsm[“scvi”] (requires schema change). Add model provenance, e.g. a scvi_summary data frame with relevant information about the run.
Perform lightweight QA to check for integrity of embeddings.
To create a process to save and expose the trained/fine-tuned model for API access
The model used for embedding generation is stored somewhere accessible by users via the API
The model torch model should be saved via torch.save and be available for loading via torch.load. Mimic scVI’s save and load
A notebook should demonstrate examples to load and use the model, in particular how to obtain latent spaces on user’s data.
[STRETCH] To create a process to find best hyper-parameters for an scVI model across all unique human and mouse cells.
Run the hyperparameter search and create a report. Automatically select the best combination of parameters.
[STRETCH] To create a process to fine-tune an scVI model across new unique human and mouse cells not previously seen by the model.
Pre-process newly added data (e.g. select primary data, filter cells/genes, subset to highly variable genes).
Fine-tune scVI model and save it.
Note: The need for fine-tuning should depend upon:
Whether the cost/time of doing a full training weekly is prohibitive.
Whether a fine-tuned model will improve materially with new weekly data.
[STRETCH] To Assess/prototype usage of Census PyTorch loader for use in the training pipeline
Provide a delineation of the work necessary to translate the anndata workflow into the PyTorch workflow.
Goal
To enable users to readily start atlas-level analysis of Census for any or all cells of an organism.
User stories
Census currently allows access to single-cell data from hundreds of different datasets, cells from one dataset are in a different numerical space as compared to cells from any other dataset.
Therefore, while users can access all of these data, they cannot readily start their analysis to answer scientific questions about cell biology.
Integration aligns the numerical space of all cells, enabling a multitude of user stories. Below are just a selection of some of the most relevant stories that we will fulfill with this project.
Approach
We are to accomplish the goal by providing scVI-based integrated embeddings in the Census SOMA data along with the trained model.
As detailed in this document, and for the first iteration of project, at a high-level we need to create a workflow that can be manually triggered behind the following tasks:
KRs
Assumptions and Risks
Assumptions
Risks
Plan
Important notes about the plan
View high-resolution schematic at FigJam here
Private Zenhub Image
To create a process to train an scVI model across all unique human and mouse cells.
To create a process to generate and save embeddings from a trained model across all human and mouse cells
To create a process to save and expose the trained/fine-tuned model for API access
[STRETCH] To create a process to find best hyper-parameters for an scVI model across all unique human and mouse cells.
[STRETCH] To create a process to fine-tune an scVI model across new unique human and mouse cells not previously seen by the model.
Fine-tune scVI model and save it.
Note: The need for fine-tuning should depend upon:
[STRETCH] To Assess/prototype usage of Census PyTorch loader for use in the training pipeline