Goal

To enable users to readily start atlas-level analysis of Census for any or all cells of an organism.

User stories

Census currently allows access to single-cell data from hundreds of different datasets, cells from one dataset are in a different numerical space as compared to cells from any other dataset.

Therefore, while users can access all of these data, they cannot readily start their analysis to answer scientific questions about cell biology.

Integration aligns the numerical space of all cells, enabling a multitude of user stories. Below are just a selection of some of the most relevant stories that we will fulfill with this project.

I want to perform cell cluster analysis to understand underlying causes for similarities and differences across cell groups.
I want to project my dataset onto the Census embeddings for inference workflows.
I want to visualize cell clusters in a scatter plot for any slice of Census data.

Approach

We are to accomplish the goal by providing Geneformer-based integrated embeddings in the Census SOMA data along with the fine-tuned Geneformer model.

As detailed in this document, and for the first iteration of the project, at a high-level we need to create a workflow that can be manually triggered behind the following tasks:

A Geneformer model is fine-tuned for cell classification on a cell metadata variable (e.g. cell type). To be done on latest LTS build
The fine-tuned model and/or pretrained model is used to generate embeddings
The embeddings are incorporated into, and published with, the Census SOMA object.
The model is accessible for download and re-use.

KRs

The LTS Census build, projected to release in November, has Geneformer latent spaces in obsm for both the human SOMA.Experiment.
The fine-tuned Geneformer model associated with the Census build is available for download.
Notebooks that demonstrate user stories:
- Accessing embeddings and making umap plot on a slice of data, differential gene expression, getting normalized gene expression matrix for a slice of Census, non-census data projection, cell type prediction.

Assumptions and Risks

Assumptions

Geneformer produces biological relevant models for Census data (discovery work ongoing). Geneformer code for modeling is ready for production use with minimal re-engineering. Geneformer training and tuning can be done within CZI’s computing constraints.

Risks

The models may not produce true biological integration, and thus it may not be usable in the first iteration. Associated discovery work is described elsewhere. Compute resources at CZI are not sufficient to fine-tune the model at a reasonable timeframe (<1 week per run)

Plan

Important notes about the plan

All the steps below are performed using a huggingface Dataset workflow.
All of the processes below should be CLI-driven, stand-alone python scripts to execute the workflows. Usually the only input for these should be a YAML config to set the params (e.g, census S3 URL, plus the other params from your the workflows)

To create a process to fine-tune a Geneformer cell classifier model across all unique human cells

Pre-process data. Select primary data and produce a Geneformer-ready huggingface Dataset.
Fine-tuned Geneformer model and save for later use.

To create a process to generate and save embeddings from a fine-tuned and/or pretrained Geneformer model across all human cells

Do a forward pass of all cells through the fine-tuned Geneformer model and extract embeddings (i.e. latent spaces)
Add matrices of embeddings to Census data under obsm[“geneformer_zero”] and obsm[“geneformer_one”] (requires schema change). Add model provenance, e.g. a geneformer_summary data frame with relevant information about the run.
Perform lightweight QA to check for integrity of embeddings.

To create a process to save and expose the fine-tuned Geneformer model for API access

The fine-tuned model used for embedding generation is stored somewhere accessible by users via the API
The model torch model should be saved via torch.save and be available for loading via torch.load.
A notebook should demonstrate examples to load and use the model, in particular how to obtain latent spaces on user’s data and do cell type inference.

[STRETCH] To create a process to find best hyper-parameters for Geneformer fine-tuning across all unique human cells

Run the hyperparameter search and create a report. Automatically select the best combination of parameters.

chanzuckerberg / single-cell

Geneformer embeddings and models are incorporated into Census #585