Closed dosumis closed 4 months ago
Most annotation transfer tools transfer to single cells, giving a confidence score per cell.
I think they all transfer to individual cells, with predictions (i.e. cell type labels) and prediction scores per cell.
Each "prediction" is at the "cell annotation set" (labelset) level though.
obs['CellTypist']: "gamma delta T-Cell"
obs['CellTypist--conf-score']: 0.955991
obs['CellTypist--cell_ontology_term']: "gamma-delta T Cell"
obs['CellTypist--cell_ontology_id']: "CL:0000798"
uns['cell_annotation_schema']:
{ "labelsets": [ { "name" : "CellTypist", "type": "automated", "algorithm": "CellTypist", "algorithm_url": "https://celltypist.org", "model": "Immune_All_Low.pkl", "majority_voting": True }
I'm fine with this, given the assumption that CellTypist
is the name of the cell annotation set (labelset).
Each "prediction" is at the "cell annotation set" (labelset) level though.
I'm not sure I follow. The predictions define new cell sets (labelset + value).
The main issue is that Individual cells may have multiple, potentially mutually incompatible predictions. It is my understanding the these are used collectively (with scores) to judge final annotation of cell sets in BICAN - built up out of a fixed set of clusters. Final annotation and associated ontology term may again be different from predictions. Surfacing all of these as obs is potentially very confusing to downstream users - even though they are valuable as evidence for final annotation.
Test implementations:
Discussion from sprint call.
Issue - not all tools have sufficiently rich output.
Needs a more complete functional specification for what tools need to provide.
Lydia:
Kyle:
The predictions define new cell sets (labelset + value).
Agreed
The main issue is that Individual cells may have multiple, potentially mutually incompatible predictions. It is my understanding the these are used collectively (with scores) to judge final annotation of cell sets in BICAN - built up out of a fixed set of clusters. Final annotation and associated ontology term may again be different from predictions. Surfacing all of these as obs is potentially very confusing to downstream users - even though they are valuable as evidence for final annotation.
It sounds like it's then a new "cell annotation set/labelset". Probably only the final evidence will be useful for users.
Surfacing all of these as obs is potentially very confusing to downstream users
If it's not in the AnnData file (or there's not tooling to port it into the AnnData file), I promise you no computational biologist will use it.
It sounds to me like much of this is an intermediate result which doesn't need to be placed into the "final" AnnData.
More mockups - using the results of an annotation transfer with MapMyCells - storing confidence with cell_ids in place of storing a list of cell_ids. Bloat is a bit of an issue. OTOH - JSON compresses very well.
{
"dataset": "cxg_dataset:bf6a5c78-5a2e-4e34-93f3-7be5d127d879",
"labelsets": [
{
"name": "MapMyCells_10x_mouse_subclass",
"description": "Annotation transfer from 10X whole mouse brain CCN20230722 using Map My Cells.",
"annotation_method": "algorithmic",
"automated_annotation": {
"algorithm_name": "Map My Cells - hierarchical mapping",
"source_taxonomy": "CCN20230722"
}
}
],
"annotations": [
{
"labelset": "map_my_cells_CCN20230722",
"cell_label": "004 L6 IT CTX Glut",
"hash_accession": "234d8d6eb87f", // using blake_2b on cell IDs
"annotation_transfer": {
"source_accession": "CS20230722_SUBC_004"
},
"cells": [ // Rather than store cell_ids we could store cell_id dicts for metadata that only makes sense at cell level.
{
"cell_id": "20171204_sample_4",
"confidence": 1.0
},
{
"cell_id": "20171207_sample_7",
"confidence": 1.0
},
{
"cell_id": "20180102_sample_2",
"confidence": 1.0
}
...
}
]
}
Most annotation transfer tools transfer to single cells, giving a confidence score per cell. Supporting this requires working at the single cell (obs) level.
Examples:
CellTypist adds the following obs:
https://colab.research.google.com/github/Teichlab/celltypist/blob/main/docs/notebook/celltypist_tutorial.ipynb
CellTypist terms are also mapped to CL - some exactly:
e.g. https://www.celltypist.org/encyclopedia/Immune/v2/?celltype=gamma-delta%20T%20cells
Suggested schema compliance:
obs['CellTypist']: "gamma delta T-Cell" obs['CellTypist--conf-score']: 0.955991 obs['CellTypist--cell_ontology_term']: "gamma-delta T Cell" obs['CellTypist--cell_ontology_id']: "CL:0000798"
uns['cell_annotation_schema']: { "labelsets": [ { "name" : "CellTypist", "type": "automated", "algorithm": "CellTypist", "algorithm_url": "https://celltypist.org", "model": "Immune_All_Low.pkl", "majority_voting": True }
Note there are 2 keys that are from outside the schema, used to encode tool specific information.
Challenge - can we keep the JSON schema validation model in this case? The only way to express this in JSON schema would be as an object directly linking cell ID with conf-score key value pair.