cellannotation / cell-annotation-schema

General, open-standard schema for cell annotations
11 stars 2 forks source link

Support annotation transfer provenance on the single cell level #61

Closed dosumis closed 4 months ago

dosumis commented 10 months ago

Most annotation transfer tools transfer to single cells, giving a confidence score per cell. Supporting this requires working at the single cell (obs) level.

Examples:

CellTypist adds the following obs:

image https://colab.research.google.com/github/Teichlab/celltypist/blob/main/docs/notebook/celltypist_tutorial.ipynb

CellTypist terms are also mapped to CL - some exactly:

e.g. image https://www.celltypist.org/encyclopedia/Immune/v2/?celltype=gamma-delta%20T%20cells

Suggested schema compliance:

obs['CellTypist']: "gamma delta T-Cell" obs['CellTypist--conf-score']: 0.955991 obs['CellTypist--cell_ontology_term']: "gamma-delta T Cell" obs['CellTypist--cell_ontology_id']: "CL:0000798"

uns['cell_annotation_schema']: { "labelsets": [ { "name" : "CellTypist", "type": "automated", "algorithm": "CellTypist", "algorithm_url": "https://celltypist.org", "model": "Immune_All_Low.pkl", "majority_voting": True }

Note there are 2 keys that are from outside the schema, used to encode tool specific information.

Challenge - can we keep the JSON schema validation model in this case? The only way to express this in JSON schema would be as an object directly linking cell ID with conf-score key value pair.

evanbiederstedt commented 10 months ago

Most annotation transfer tools transfer to single cells, giving a confidence score per cell.

I think they all transfer to individual cells, with predictions (i.e. cell type labels) and prediction scores per cell.

Each "prediction" is at the "cell annotation set" (labelset) level though.

obs['CellTypist']: "gamma delta T-Cell"
obs['CellTypist--conf-score']: 0.955991
obs['CellTypist--cell_ontology_term']: "gamma-delta T Cell"
obs['CellTypist--cell_ontology_id']: "CL:0000798"

uns['cell_annotation_schema']:
{ "labelsets": [ { "name" : "CellTypist", "type": "automated", "algorithm": "CellTypist", "algorithm_url": "https://celltypist.org", "model": "Immune_All_Low.pkl", "majority_voting": True }

I'm fine with this, given the assumption that CellTypist is the name of the cell annotation set (labelset).

dosumis commented 10 months ago

Each "prediction" is at the "cell annotation set" (labelset) level though.

I'm not sure I follow. The predictions define new cell sets (labelset + value).

The main issue is that Individual cells may have multiple, potentially mutually incompatible predictions. It is my understanding the these are used collectively (with scores) to judge final annotation of cell sets in BICAN - built up out of a fixed set of clusters. Final annotation and associated ontology term may again be different from predictions. Surfacing all of these as obs is potentially very confusing to downstream users - even though they are valuable as evidence for final annotation.

dosumis commented 10 months ago

Test implementations:

dosumis commented 10 months ago

Discussion from sprint call.

Issue - not all tools have sufficiently rich output.

Needs a more complete functional specification for what tools need to provide.

Lydia:

Kyle:

evanbiederstedt commented 10 months ago

The predictions define new cell sets (labelset + value).

Agreed

The main issue is that Individual cells may have multiple, potentially mutually incompatible predictions. It is my understanding the these are used collectively (with scores) to judge final annotation of cell sets in BICAN - built up out of a fixed set of clusters. Final annotation and associated ontology term may again be different from predictions. Surfacing all of these as obs is potentially very confusing to downstream users - even though they are valuable as evidence for final annotation.

It sounds like it's then a new "cell annotation set/labelset". Probably only the final evidence will be useful for users.

Surfacing all of these as obs is potentially very confusing to downstream users

If it's not in the AnnData file (or there's not tooling to port it into the AnnData file), I promise you no computational biologist will use it.

It sounds to me like much of this is an intermediate result which doesn't need to be placed into the "final" AnnData.

dosumis commented 10 months ago

More mockups - using the results of an annotation transfer with MapMyCells - storing confidence with cell_ids in place of storing a list of cell_ids. Bloat is a bit of an issue. OTOH - JSON compresses very well.

{  
  "dataset": "cxg_dataset:bf6a5c78-5a2e-4e34-93f3-7be5d127d879",   
  "labelsets": [
    {
      "name": "MapMyCells_10x_mouse_subclass",  
      "description": "Annotation transfer from 10X whole mouse brain CCN20230722 using Map My Cells.",  
      "annotation_method": "algorithmic",  
      "automated_annotation": {  
        "algorithm_name": "Map My Cells - hierarchical mapping", 
        "source_taxonomy": "CCN20230722"  
      }  
    }  
  ],  
  "annotations": [  
    {  
      "labelset": "map_my_cells_CCN20230722",  
      "cell_label": "004 L6 IT CTX Glut",  
      "hash_accession": "234d8d6eb87f", // using blake_2b on cell IDs
      "annotation_transfer": {  
         "source_accession": "CS20230722_SUBC_004"  
      },  
      "cells": [ // Rather than store cell_ids we could store cell_id dicts for metadata that only makes sense at cell level.  
        {  
          "cell_id": "20171204_sample_4",  
          "confidence": 1.0  
        },  
        {  
          "cell_id": "20171207_sample_7",  
          "confidence": 1.0  
        },  
        {  
          "cell_id": "20180102_sample_2",  
          "confidence": 1.0  
        } 
        ...
        }  
      ]  
    }