A stand-alone process that injects embeddings and its metadata into a Census object.

Requirements

The process:

Should take metadata in format to be defined by @ebezzi and inject it into a SOMA Data Frame census["census_info"][ID]. Where ID is a string identifier for the embeddings, suggested IDs for prospective embeddings: "scvi" and "geneformer_cell_subclass". For required values see section "Embedding Metadata requirements"
Should take embeddings in format to be defined by @ebezzi (recommendation: follow Bruce's community-contributed formats) and inject them into SOMA SparseNDArray census["census_info"]["census_data"][organism].ms["RNA"].obsm[ID]. Where organism is either "homo_sapiens" or "mus_musculus" and ID as defined above.
Should be able to deposit the trained model and its assets into
Should validate that the embeddings were properly injected, please coordinate with @bkmartinjr to devise a good validation strategy (e.g. no rows were scrambled, shape is maintained, data entropy while reduced the order should be maintained, etc)

To be defined by @ebezzi , please coordinate with @bkmartinjr. For the embedding matrix:

TileDB schema, follow similar decisions as community-contributed embeddings.
Numerical precision, follow similar decisions as community-contributed embeddings.

The embedding metadata should be stored as a SOMADataFrame with two columns:

Column	Encoding	Description
label	string	Human readable label of metadata variable
value	string	Value associated to metadata variable

This SOMADataFrame MUST have the following rows:

Name of the model used:
1. label: "model"
2. value: the name of the model
obsm ID:
1. label: "obsm_id"
2. value: the ID in obsm
Organisms:
1. label: "organisms"
2. value: A comma separated list of organisms where the embeddings are available.
Trained model location.
1. label: "model_location"
2. value: A URI or HTTP link to where the trained model is located
Link to model documentation:
1. label: "doccumentation_link"
2. value: a link to a documentation page
Training parameters:
1. label: "training_details"
2. value: a long-text description of the training and embedding generation details of the model.
An example of this SOMADataFrame is shown below:

label	value
model	scVI
obsm_id	scvi
organisms	homo_sapiens, mus_musculus
model_location	s3://cellxgene-census-public-us-west-2/cell-census/2023-10-23/models/scvi/
doccumentation_link	https://scvi-tools.org/
training_details	scVI was trained with 8000 highly-variable genes with batch defined as [...] and the following parameters: n_hidden=512, n_latent=200, [...]. Then embeddings were obtained for all cells by [...]