Support annotation transfer provenance on the single cell level

cellannotation / cell-annotation-schema

General, open-standard schema for cell annotations

11 stars 2 forks source link

Support annotation transfer provenance on the single cell level #61

Closed dosumis closed 4 months ago

dosumis commented 10 months ago

Most annotation transfer tools transfer to single cells, giving a confidence score per cell. Supporting this requires working at the single cell (obs) level.

Examples:

CellTypist adds the following obs:

https://colab.research.google.com/github/Teichlab/celltypist/blob/main/docs/notebook/celltypist_tutorial.ipynb

CellTypist terms are also mapped to CL - some exactly:

e.g. https://www.celltypist.org/encyclopedia/Immune/v2/?celltype=gamma-delta%20T%20cells

Suggested schema compliance:

obs['CellTypist']: "gamma delta T-Cell" obs['CellTypist--conf-score']: 0.955991 obs['CellTypist--cell_ontology_term']: "gamma-delta T Cell" obs['CellTypist--cell_ontology_id']: "CL:0000798"

uns['cell_annotation_schema']: { "labelsets": [ { "name" : "CellTypist", "type": "automated", "algorithm": "CellTypist", "algorithm_url": "https://celltypist.org", "model": "Immune_All_Low.pkl", "majority_voting": True }

Note there are 2 keys that are from outside the schema, used to encode tool specific information.

Challenge - can we keep the JSON schema validation model in this case? The only way to express this in JSON schema would be as an object directly linking cell ID with conf-score key value pair.

evanbiederstedt commented 10 months ago

Most annotation transfer tools transfer to single cells, giving a confidence score per cell.

I think they all transfer to individual cells, with predictions (i.e. cell type labels) and prediction scores per cell.

Each "prediction" is at the "cell annotation set" (labelset) level though.

obs['CellTypist']: "gamma delta T-Cell"
obs['CellTypist--conf-score']: 0.955991
obs['CellTypist--cell_ontology_term']: "gamma-delta T Cell"
obs['CellTypist--cell_ontology_id']: "CL:0000798"

uns['cell_annotation_schema']:
{ "labelsets": [ { "name" : "CellTypist", "type": "automated", "algorithm": "CellTypist", "algorithm_url": "https://celltypist.org", "model": "Immune_All_Low.pkl", "majority_voting": True }

I'm fine with this, given the assumption that CellTypist is the name of the cell annotation set (labelset).

dosumis commented 10 months ago

Each "prediction" is at the "cell annotation set" (labelset) level though.

I'm not sure I follow. The predictions define new cell sets (labelset + value).

The main issue is that Individual cells may have multiple, potentially mutually incompatible predictions. It is my understanding the these are used collectively (with scores) to judge final annotation of cell sets in BICAN - built up out of a fixed set of clusters. Final annotation and associated ontology term may again be different from predictions. Surfacing all of these as obs is potentially very confusing to downstream users - even though they are valuable as evidence for final annotation.

dosumis commented 10 months ago

Test implementations:

Implement for test case in this collab notebook https://colab.research.google.com/github/Teichlab/celltypist/blob/main/docs/notebook/celltypist_tutorial.ipynb - needs to be shared with Chuan Xu (Cell Typist dev)
@hkir-dev to implement for Basal Ganglion example from @UCDNJJ
MapMy Cells - run https://cellxgene.cziscience.com/e/bf6a5c78-5a2e-4e34-93f3-7be5d127d879.cxg/ (small mouse patch-seq dataset from cortex) against the whole mouse brain. The JSON returned should have sufficient information for conversion to CAS.

dosumis commented 10 months ago

Discussion from sprint call.

Issue - not all tools have sufficiently rich output.

Needs a more complete functional specification for what tools need to provide.

Lydia:

First priority - output identifiers.

Kyle:

It seems silly to add a bunch of additional fields to obs that don't need to be there. The only thing that needs to be there is confidence score.
Lydia - even this could be split out from AnnData obs

evanbiederstedt commented 10 months ago

The predictions define new cell sets (labelset + value).

Agreed

The main issue is that Individual cells may have multiple, potentially mutually incompatible predictions. It is my understanding the these are used collectively (with scores) to judge final annotation of cell sets in BICAN - built up out of a fixed set of clusters. Final annotation and associated ontology term may again be different from predictions. Surfacing all of these as obs is potentially very confusing to downstream users - even though they are valuable as evidence for final annotation.

It sounds like it's then a new "cell annotation set/labelset". Probably only the final evidence will be useful for users.

Surfacing all of these as obs is potentially very confusing to downstream users

If it's not in the AnnData file (or there's not tooling to port it into the AnnData file), I promise you no computational biologist will use it.

It sounds to me like much of this is an intermediate result which doesn't need to be placed into the "final" AnnData.

dosumis commented 10 months ago

More mockups - using the results of an annotation transfer with MapMyCells - storing confidence with cell_ids in place of storing a list of cell_ids. Bloat is a bit of an issue. OTOH - JSON compresses very well.

{  
  "dataset": "cxg_dataset:bf6a5c78-5a2e-4e34-93f3-7be5d127d879",   
  "labelsets": [
    {
      "name": "MapMyCells_10x_mouse_subclass",  
      "description": "Annotation transfer from 10X whole mouse brain CCN20230722 using Map My Cells.",  
      "annotation_method": "algorithmic",  
      "automated_annotation": {  
        "algorithm_name": "Map My Cells - hierarchical mapping", 
        "source_taxonomy": "CCN20230722"  
      }  
    }  
  ],  
  "annotations": [  
    {  
      "labelset": "map_my_cells_CCN20230722",  
      "cell_label": "004 L6 IT CTX Glut",  
      "hash_accession": "234d8d6eb87f", // using blake_2b on cell IDs
      "annotation_transfer": {  
         "source_accession": "CS20230722_SUBC_004"  
      },  
      "cells": [ // Rather than store cell_ids we could store cell_id dicts for metadata that only makes sense at cell level.  
        {  
          "cell_id": "20171204_sample_4",  
          "confidence": 1.0  
        },  
        {  
          "cell_id": "20171207_sample_7",  
          "confidence": 1.0  
        },  
        {  
          "cell_id": "20180102_sample_2",  
          "confidence": 1.0  
        } 
        ...
        }  
      ]  
    }