chanzuckerberg / single-cell

A collection of documents that reflect various design decisions that have been made for the cellxgene project.
MIT License
4 stars 2 forks source link

Geneformer embeddings and models are incorporated into Census #585

Closed pablo-gar closed 10 months ago

pablo-gar commented 1 year ago

Goal

To enable users to readily start atlas-level analysis of Census for any or all cells of an organism.

User stories

Census currently allows access to single-cell data from hundreds of different datasets, cells from one dataset are in a different numerical space as compared to cells from any other dataset.

Therefore, while users can access all of these data, they cannot readily start their analysis to answer scientific questions about cell biology.

Integration aligns the numerical space of all cells, enabling a multitude of user stories. Below are just a selection of some of the most relevant stories that we will fulfill with this project.

Approach

We are to accomplish the goal by providing Geneformer-based integrated embeddings in the Census SOMA data along with the fine-tuned Geneformer model.

As detailed in this document, and for the first iteration of the project, at a high-level we need to create a workflow that can be manually triggered behind the following tasks:

KRs

  1. The LTS Census build, projected to release in November, has Geneformer latent spaces in obsm for both the human SOMA.Experiment.
  2. The fine-tuned Geneformer model associated with the Census build is available for download.
  3. Notebooks that demonstrate user stories:
    • Accessing embeddings and making umap plot on a slice of data, differential gene expression, getting normalized gene expression matrix for a slice of Census, non-census data projection, cell type prediction.

Assumptions and Risks

Assumptions

Geneformer produces biological relevant models for Census data (discovery work ongoing). Geneformer code for modeling is ready for production use with minimal re-engineering. Geneformer training and tuning can be done within CZI’s computing constraints.

Risks

The models may not produce true biological integration, and thus it may not be usable in the first iteration. Associated discovery work is described elsewhere. Compute resources at CZI are not sufficient to fine-tune the model at a reasonable timeframe (<1 week per run)

Plan

Important notes about the plan

To create a process to fine-tune a Geneformer cell classifier model across all unique human cells

To create a process to generate and save embeddings from a fine-tuned and/or pretrained Geneformer model across all human cells

To create a process to save and expose the fine-tuned Geneformer model for API access

[STRETCH] To create a process to find best hyper-parameters for Geneformer fine-tuning across all unique human cells

pablo-gar commented 10 months ago

completed