Closed kira-neller closed 6 months ago
@kira-neller Is this still an open question?
Given that CellProcessing
precedes sequencing, I'm guessing this is out of scope for that object. Though, maybe it's worth copying the annotation to the Cell
object and clarifying there.
IMO this should be addressed by #700
The original question concerned cell phenotype annotation based on GEX markers.
@bussec I'm not sure this would be addressed by #700 because the changes proposed there do not include a secondary metadata annotation (i.e. interpretation) of cell phenotype by the researcher, rather a primary quantification of all available markers.
@bcorrie Did you have anything to add? Is this question still relevant?
@bussec #700 is adding something to CellExpression. I think this is referring to adding something to Cell
, no?
At risk of exposing my lack of deep knowledge of these technologies ... if one does feature barcoding to identify specific cell phenotypes:
C0063 FB_CD45RA Antibody Capture
one can use the count info to infer a Cell's phenotype. Currently there is no way in the Cell
object to capture that experimental outcome. There is no cell_phenotype
field in the Cell
object. Not sure if there should be, but I think this is a different question than #700.
@bcorrie @kira-neller You are right, these are different questions:
As discussed last week, there is the CellProcessing.cell_subset
as a property to annotate cell populations. However, there are two problems in using this property:
Repertoire
is very hierarchical, as it was mainly designed to represent bulk sequencing workflows. Therefore you cannot create many CellProcessing
objects and bring them together again in a single down-stream object. And I doubt that a RepertoireGroup
would really help here ;-)Long story short, we should introduce a cell_subset
property as part of Cell
, which can hold such classifications that were only obtained during the analysis of the data.
@bussec do we try and tackle this for v2.0? Do we just add something like:
cell_subset:
$ref: '#/Ontology'
description: Commonly-used designation of inferred cell type
title: Cell subset
example:
id: CL:0000972
label: class switched memory B cell
x-airr:
miairr: important
nullable: true
adc-query-support: true
set: 3
subset: process (cell)
name: Cell subset
format: ontology
ontology:
draft: false
top_node:
id: CL:0000542
label: lymphocyte
cell_phenotype:
type: string
description: List of genes and expression levels used to classify the cell phenotype.
title: Cell subset phenotype
example: CD19+ CD38+ CD27+ IgM- IgD-
x-airr:
miairr: important
nullable: true
adc-query-support: true
set: 3
subset: process (cell)
name: Cell subset phenotype
I changed the wording of the two fields, but they are essentially from CellProcessing
@bussec what are your thought on this. It would be easy to add the above to the Cell
object, which would allow the annotation of cell phenotype at the cell level.
One should be able to capture the fact that this was done (and how) in either the CellProcessing
(using cell_isolation
or cell_processing_protocols
) or DataProcessing
(using data_processing_protocols
) objects associated with the Repertoire
no?
@bcorrie There are two potential problems that I see with the structure you suggested (but you might want to talk to other 10X data producers about this, too):
cell_phenotype
was designed as a flow cytometry based field, where you have several dozen markers and a clear understanding whether they are relevant for the classify your cell. For single-cell transcriptomes however we basically always use dimensional reduction in some shape or form, which then creates some kind of composite dimensions. This often will make it very difficult to come up with a concise list of marker that are necessary and sufficient for the classification. Therefore I would suggest to drop this field for now.cell_subset
might not be the best decision, as I am not sure how good the mapping between the cell type classification of the tools and CL is. Probably its worthwhile to do a quick check whether you could unambiguously assign matching CL concepts to the dataset that sparked this issue... if there is tool much loss of information we should think about just having a string here for now.I suggest we leave this out of v2.0 if the change isn't trivial.
I think there are actually two distinct use cases here. One is CITEseq (feature barcoding) as @bcorrie mentioned above. That behaves (or can be used) much more like flow cytometry data. (In practice, though, I think it is typically combined with transcriptomic data using wnn or similar algorithms.)
The other is a pure transcriptomic identification. Here I would push back on @bussec slightly, as if all you had were unsupervised clusters of dimensionally-reduced data, I don't think you should be using any kind of cell_phenotype
in the first place. OTOH, if I identify cluster 8 as class-switched memory B cells based on the upregulation of CD20, CD27, and IgG plus the down-regulation or absence of IgM and CD21, then I think that mitigates some of your objections.
OTOH, if I identify cluster 8 as class-switched memory B cells based on the upregulation of CD20, CD27, and IgG plus the down-regulation or absence of IgM and CD21, then I think that mitigates some of your objections.
@scharch I think the use of a tool that clusters and classifies data based on cell specific training (e.g. CellTypist) would fall in to this case, no? So presumably I could run CellTypist, classify each Cell, and then store that as computed cell phenotype.
Hmmmm now that you mention it, that seems to strengthen @bussec's case: an annotation from CellTypist or similar isn't explicitly measuring particular genes.
So then I think @bussec was right: make Cell_expression.cell_subset
free text, instead of an ontology, and drop cell_phenotype
.
What @bcorrie suggests looks good to me. I'm not convinced by the arguments for why those fields should be relaxed or removed. My understanding is that these are inferred properties of the Cell
, so if the inference is ambiguous or imprecise or you just don't think it's informative, then leave them null. The inference may be incorrect, but it may be useful.
The clustering can be supervised or unsupervised. In the supervised case, the list of genes is given, so put them in cell_phenotype
. In the unsupervised case, the algorithm will give the combination of genes, so put them in cell_phenotype
. If you want to know if one method is used versus the other, then add a third field with some enums that provide that info. If you don't like cell_phenotype
because the name implies something different to you, then change the field name to cell_gene_markers
or something.
cell_subset
should stay cell ontology, making it free text doesn't seem to help anything. If you cannot give a specific cell type then use a more general term. If you have no idea, leave it null.
cell_subset
should stay cell ontology, making it free text doesn't seem to help anything. If you cannot give a specific cell type then use a more general term. If you have no idea, leave it null.
What if CellTypist or Azimuth assigns cell types that aren't in the ontology?
cell_subset
should stay cell ontology, making it free text doesn't seem to help anything. If you cannot give a specific cell type then use a more general term. If you have no idea, leave it null.What if CellTypist or Azimuth assigns cell types that aren't in the ontology?
There's multiple ways to answer. The first thought is to ask, are these tools limited to lymphocytes or to any cell type? If the latter, then yes there is the larger issue that we are still discovering and annotating new cell types (i.e. the various human cell atlas projects). These cell types need to be added to the ontology over time. The cell ontology has a hierarchy of cell types so it should try to be ask specific if possible but in the extreme worse case it's just a cell. Or in the former case the worse case is just lymphocyte.
The other question is if the tool isn't assigning an ontology cell type, then what is it doing? Is it just a label like cell_type_123
that no meaning? I would argue that providing that in cell_subset
as an open text field doesn't offer much usefulness for annotation. In this case, it is likely the combination of gene markers (cell_phenotype
) that really is the definition. I think then for us we'd want to provide guidance to tools, is it better to use a generic term like lymphocyte, or leave it null to say that it doesn't really know.
There is also burden on the researcher. If you discover a new cell type, which would be exciting, you really should go to CL and say hey I think this is new, please add a term for it. If anything, they can give you a CL ID which you can use, put in your paper, etc.
There is also the possibility that the output of these tools is "garbage", not meaningless but not a true unique cell type but instead a mishmash of existing types. Honestly, this might be the most common case. In that, I would suggest instead of loosening cell_subset
that another field is added like cell_type_description
or cell_phenotype_description
that allows you to describe your interpretation of the cell type.
The other question is if the tool isn't assigning an ontology cell type, then what is it doing?
I think it's just competing definition systems. Like, it wouldn't be hard to turn CellTypist labels into a hierarchical ontology, but AFAIK they are not talking to CL and so there's no guarantee of correspondence between the available tags.
I think a combination of "most specific possible" and adding a cell_type_description
would work if we document that properly.
The other question is if the tool isn't assigning an ontology cell type, then what is it doing?
I think it's just competing definition systems. Like, it wouldn't be hard to turn CellTypist labels into a hierarchical ontology, but AFAIK they are not talking to CL and so there's no guarantee of correspondence between the available tags.
I see. I searched around to see if there was a mapping between the two and this paper says: "CellTypist uses an expandable cross-tissue cell reference before predicting cell identities with a logistic regression-based label transfer pipeline, with all derived cell types directly interpretable by CL[48]" with reference to this paper. I don't know if that's for real or just pie in the sky.
I could see taking CL as a base and then adding annotations to help the program, as those annotations would not go into the ontology itself, but it would unfortunate if they threw away the cross references.
Annotation of cell types can occur through many routes - surface proteins, individual gene markers, clustering and differential expression between clusters, predictions from pretrained models, integration/label transfer from reference datasets, etc. I don't think we need to worry about how the cell type annotations were determined to define a field for them (leave that to DataProcessing).
Having Cell Ontology terms in a cell_type
field is super valuable for working with multiple datasets and it'd be nice to encourage it. However, CL isn't always going to be granular enough for every study. Lots of single-cell work involves classifying novel (or study-specific) cell types or states, which CL won't have covered. It'd still be nice to have the appropriate parent term though, so you have something you can use to harmonized across data sets, even if it's not as granular as the study authors' original annotations.
A lot of the time, we end up using multiple layers of annotation, like:
cell_type_level_1
: "Epithelial"cell_type_level_2
: "AT2"cell_type_level_3
: "Transitional AT2"Cell Ontology is good for eliminating the need for level 1 and covering level 2, but I think we'll probably still need a free text field in addition to the CL field if we want to include fields for cell type annotation in Cell
(which seems like something we should include). cell_type_description
doesn't seem like the right semantics to me though, because the free text field is still an annotation.
I searched around to see if there was a mapping between the two
100 brownie points for doing actual work instead of making lazy assumptions like me :)
I searched around to see if there was a mapping between the two
100 brownie points for doing actual work instead of making lazy assumptions like me :)
Nice! A brownie and coffee sounds good right about now... Total digression but I was at NIH last week for a program meeting, and we met up with Richard Scheuermann (who's now at NLM). He's doing his own cell type discovery with single-cell and was comparing his results against the human cell atlas results and was finding some "intriguing" (that's likely not the word he used ;-) differences. Anyways, I think cell types are going to be hot topic over the next 5 years if they aren't already.
Hello, I am curating 10X single-cell data for repertoire loading into iReceptor and I have a question about metadata annotation.
Regarding the cell_phenotype field, I am wondering how to represent studies that identify different cell types using RNA-seq based methods, i.e. through expression profile clustering followed by assessment of marker gene expression in clusters. This paired expression data can be obtained with the 10X platform in tandem with V(D)J sequencing.
@bussec my understanding is that this is a “secondary” metadata annotation, and currently cell_phenotype specifies markers used in flow cytometric-based isolation methods, but this is an ongoing discussion related to Work Package 7. Are there any additional comments/updates on this?