Should take metadata in format to be defined by @ebezzi and inject it into a SOMA Data Frame census["census_info"][ID]. Where ID is a string identifier for the embeddings, suggested IDs for prospective embeddings: "scvi" and "geneformer_cell_subclass". For required values see section "Embedding Metadata requirements"
Should take embeddings in format to be defined by @ebezzi (recommendation: follow Bruce's community-contributed formats) and inject them into SOMA SparseNDArray census["census_info"]["census_data"][organism].ms["RNA"].obsm[ID]. Where organism is either "homo_sapiens" or "mus_musculus" and ID as defined above.
Should be able to deposit the trained model and its assets into
Should validate that the embeddings were properly injected, please coordinate with @bkmartinjr to devise a good validation strategy (e.g. no rows were scrambled, shape is maintained, data entropy while reduced the order should be maintained, etc)
To be defined by @ebezzi , please coordinate with @bkmartinjr. For the embedding matrix:
TileDB schema, follow similar decisions as community-contributed embeddings.
Numerical precision, follow similar decisions as community-contributed embeddings.
Embedding metadata
The embedding metadata should be stored as a SOMADataFrame with two columns:
Column
Encoding
Description
label
string
Human readable label of metadata variable
value
string
Value associated to metadata variable
This SOMADataFrame MUST have the following rows:
Name of the model used:
label: "model"
value: the name of the model
obsm ID:
label: "obsm_id"
value: the ID in obsm
Organisms:
label: "organisms"
value: A comma separated list of organisms where the embeddings are available.
Trained model location.
label: "model_location"
value: A URI or HTTP link to where the trained model is located
Link to model documentation:
label: "doccumentation_link"
value: a link to a documentation page
Training parameters:
label: "training_details"
value: a long-text description of the training and embedding generation details of the model.
scVI was trained with 8000 highly-variable genes with batch defined as [...] and the following parameters: n_hidden=512, n_latent=200, [...]. Then embeddings were obtained for all cells by [...]
Closing this ticket as this was achieved as part of the census-models release, even if only as a one-time operation. In the future we'll revisit this with more automation.
Requirements
The process:
census["census_info"][ID]
. WhereID
is a string identifier for the embeddings, suggested IDs for prospective embeddings:"scvi"
and"geneformer_cell_subclass"
. For required values see section "Embedding Metadata requirements"census["census_info"]["census_data"][organism].ms["RNA"].obsm[ID]
. Whereorganism
is either"homo_sapiens"
or"mus_musculus"
andID
as defined above.To be defined by @ebezzi , please coordinate with @bkmartinjr. For the embedding matrix:
Embedding metadata
The embedding metadata should be stored as a
SOMADataFrame
with two columns:This
SOMADataFrame
MUST have the following rows:"model"
"obsm_id"
obsm
"organisms"
"model_location"
"doccumentation_link"
Training parameters:
"training_details"
An example of this
SOMADataFrame
is shown below: