chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

A stand-alone process that injects embeddings and its metadata into a Census object. #854

Closed pablo-gar closed 10 months ago

pablo-gar commented 11 months ago

Requirements

The process:

To be defined by @ebezzi , please coordinate with @bkmartinjr. For the embedding matrix:

Embedding metadata

The embedding metadata should be stored as a SOMADataFrame with two columns:

Column Encoding Description
label string Human readable label of metadata variable
value string Value associated to metadata variable

This SOMADataFrame MUST have the following rows:

  1. Name of the model used:
    1. label: "model"
    2. value: the name of the model
  2. obsm ID:
    1. label: "obsm_id"
    2. value: the ID in obsm
  3. Organisms:
    1. label: "organisms"
    2. value: A comma separated list of organisms where the embeddings are available.
  4. Trained model location.
    1. label: "model_location"
    2. value: A URI or HTTP link to where the trained model is located
  5. Link to model documentation:
    1. label: "doccumentation_link"
    2. value: a link to a documentation page
  6. Training parameters:

    1. label: "training_details"
    2. value: a long-text description of the training and embedding generation details of the model.

    An example of this SOMADataFrame is shown below:

label value
model scVI
obsm_id scvi
organisms homo_sapiens, mus_musculus
model_location s3://cellxgene-census-public-us-west-2/cell-census/2023-10-23/models/scvi/
doccumentation_link https://scvi-tools.org/
training_details scVI was trained with 8000 highly-variable genes with batch defined as [...] and the following parameters: n_hidden=512, n_latent=200, [...]. Then embeddings were obtained for all cells by [...]
ebezzi commented 10 months ago

Closing this ticket as this was achieved as part of the census-models release, even if only as a one-time operation. In the future we'll revisit this with more automation.