airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Representing cell_phenotype using RNA-seq marker gene expression #477

Closed kira-neller closed 6 months ago

kira-neller commented 4 years ago

Hello, I am curating 10X single-cell data for repertoire loading into iReceptor and I have a question about metadata annotation.

Regarding the cell_phenotype field, I am wondering how to represent studies that identify different cell types using RNA-seq based methods, i.e. through expression profile clustering followed by assessment of marker gene expression in clusters. This paired expression data can be obtained with the 10X platform in tandem with V(D)J sequencing.

@bussec my understanding is that this is a “secondary” metadata annotation, and currently cell_phenotype specifies markers used in flow cytometric-based isolation methods, but this is an ongoing discussion related to Work Package 7. Are there any additional comments/updates on this?

scharch commented 1 year ago

@kira-neller Is this still an open question?

javh commented 1 year ago

Given that CellProcessing precedes sequencing, I'm guessing this is out of scope for that object. Though, maybe it's worth copying the annotation to the Cell object and clarifying there.

bussec commented 1 year ago

IMO this should be addressed by #700

kira-neller commented 1 year ago

The original question concerned cell phenotype annotation based on GEX markers.

@bussec I'm not sure this would be addressed by #700 because the changes proposed there do not include a secondary metadata annotation (i.e. interpretation) of cell phenotype by the researcher, rather a primary quantification of all available markers.

@bcorrie Did you have anything to add? Is this question still relevant?

bcorrie commented 1 year ago

@bussec #700 is adding something to CellExpression. I think this is referring to adding something to Cell, no?

At risk of exposing my lack of deep knowledge of these technologies ... if one does feature barcoding to identify specific cell phenotypes:

C0063   FB_CD45RA       Antibody Capture

one can use the count info to infer a Cell's phenotype. Currently there is no way in the Cell object to capture that experimental outcome. There is no cell_phenotype field in the Cell object. Not sure if there should be, but I think this is a different question than #700.

bussec commented 1 year ago

@bcorrie @kira-neller You are right, these are different questions:

As discussed last week, there is the CellProcessing.cell_subset as a property to annotate cell populations. However, there are two problems in using this property:

  1. From a semantic point of view, this property annotates which cell population a researcher attempted to isolate, but this is of course within the limits of the chosen technology (i.e., there can always be contaminations). Also, this property might also only describe a less specific purification step (e.g., all CD19+ CD27+ cells) while a later analysis of single cell data (whether it is from transcriptome, index sorting or 10X Hashtags) could be substantially more detailed. Nevertheless it is important to annotate both types of information - what you wanted and what you got.
  2. Even if you would ignorate the sematics, the way objects are nested within a Repertoire is very hierarchical, as it was mainly designed to represent bulk sequencing workflows. Therefore you cannot create many CellProcessing objects and bring them together again in a single down-stream object. And I doubt that a RepertoireGroup would really help here ;-)

Long story short, we should introduce a cell_subset property as part of Cell, which can hold such classifications that were only obtained during the analysis of the data.

bcorrie commented 8 months ago

@bussec do we try and tackle this for v2.0? Do we just add something like:

        cell_subset:
            $ref: '#/Ontology'
            description: Commonly-used designation of inferred cell type
            title: Cell subset
            example:
                id: CL:0000972
                label: class switched memory B cell
            x-airr:
                miairr: important
                nullable: true
                adc-query-support: true
                set: 3
                subset: process (cell)
                name: Cell subset
                format: ontology
                ontology:
                    draft: false
                    top_node:
                        id: CL:0000542
                        label: lymphocyte
        cell_phenotype:
            type: string
            description: List of genes and expression levels used to classify the cell phenotype.
            title: Cell subset phenotype
            example: CD19+ CD38+ CD27+ IgM- IgD-
            x-airr:
                miairr: important
                nullable: true
                adc-query-support: true
                set: 3
                subset: process (cell)
                name: Cell subset phenotype

I changed the wording of the two fields, but they are essentially from CellProcessing

bcorrie commented 8 months ago

@bussec what are your thought on this. It would be easy to add the above to the Cell object, which would allow the annotation of cell phenotype at the cell level.

One should be able to capture the fact that this was done (and how) in either the CellProcessing (using cell_isolation or cell_processing_protocols) or DataProcessing (using data_processing_protocols) objects associated with the Repertoire no?

bussec commented 8 months ago

@bcorrie There are two potential problems that I see with the structure you suggested (but you might want to talk to other 10X data producers about this, too):

  1. cell_phenotype was designed as a flow cytometry based field, where you have several dozen markers and a clear understanding whether they are relevant for the classify your cell. For single-cell transcriptomes however we basically always use dimensional reduction in some shape or form, which then creates some kind of composite dimensions. This often will make it very difficult to come up with a concise list of marker that are necessary and sufficient for the classification. Therefore I would suggest to drop this field for now.
  2. Enforcing CL as an ontology for cell_subset might not be the best decision, as I am not sure how good the mapping between the cell type classification of the tools and CL is. Probably its worthwhile to do a quick check whether you could unambiguously assign matching CL concepts to the dataset that sparked this issue... if there is tool much loss of information we should think about just having a string here for now.
bcorrie commented 8 months ago

I suggest we leave this out of v2.0 if the change isn't trivial.

scharch commented 8 months ago

I think there are actually two distinct use cases here. One is CITEseq (feature barcoding) as @bcorrie mentioned above. That behaves (or can be used) much more like flow cytometry data. (In practice, though, I think it is typically combined with transcriptomic data using wnn or similar algorithms.)

The other is a pure transcriptomic identification. Here I would push back on @bussec slightly, as if all you had were unsupervised clusters of dimensionally-reduced data, I don't think you should be using any kind of cell_phenotype in the first place. OTOH, if I identify cluster 8 as class-switched memory B cells based on the upregulation of CD20, CD27, and IgG plus the down-regulation or absence of IgM and CD21, then I think that mitigates some of your objections.

bcorrie commented 8 months ago

OTOH, if I identify cluster 8 as class-switched memory B cells based on the upregulation of CD20, CD27, and IgG plus the down-regulation or absence of IgM and CD21, then I think that mitigates some of your objections.

@scharch I think the use of a tool that clusters and classifies data based on cell specific training (e.g. CellTypist) would fall in to this case, no? So presumably I could run CellTypist, classify each Cell, and then store that as computed cell phenotype.

scharch commented 8 months ago

Hmmmm now that you mention it, that seems to strengthen @bussec's case: an annotation from CellTypist or similar isn't explicitly measuring particular genes.

So then I think @bussec was right: make Cell_expression.cell_subset free text, instead of an ontology, and drop cell_phenotype.

schristley commented 8 months ago

What @bcorrie suggests looks good to me. I'm not convinced by the arguments for why those fields should be relaxed or removed. My understanding is that these are inferred properties of the Cell, so if the inference is ambiguous or imprecise or you just don't think it's informative, then leave them null. The inference may be incorrect, but it may be useful.

The clustering can be supervised or unsupervised. In the supervised case, the list of genes is given, so put them in cell_phenotype. In the unsupervised case, the algorithm will give the combination of genes, so put them in cell_phenotype. If you want to know if one method is used versus the other, then add a third field with some enums that provide that info. If you don't like cell_phenotype because the name implies something different to you, then change the field name to cell_gene_markers or something.

cell_subset should stay cell ontology, making it free text doesn't seem to help anything. If you cannot give a specific cell type then use a more general term. If you have no idea, leave it null.

scharch commented 8 months ago

cell_subset should stay cell ontology, making it free text doesn't seem to help anything. If you cannot give a specific cell type then use a more general term. If you have no idea, leave it null.

What if CellTypist or Azimuth assigns cell types that aren't in the ontology?

schristley commented 8 months ago

cell_subset should stay cell ontology, making it free text doesn't seem to help anything. If you cannot give a specific cell type then use a more general term. If you have no idea, leave it null.

What if CellTypist or Azimuth assigns cell types that aren't in the ontology?

There's multiple ways to answer. The first thought is to ask, are these tools limited to lymphocytes or to any cell type? If the latter, then yes there is the larger issue that we are still discovering and annotating new cell types (i.e. the various human cell atlas projects). These cell types need to be added to the ontology over time. The cell ontology has a hierarchy of cell types so it should try to be ask specific if possible but in the extreme worse case it's just a cell. Or in the former case the worse case is just lymphocyte.

The other question is if the tool isn't assigning an ontology cell type, then what is it doing? Is it just a label like cell_type_123 that no meaning? I would argue that providing that in cell_subset as an open text field doesn't offer much usefulness for annotation. In this case, it is likely the combination of gene markers (cell_phenotype) that really is the definition. I think then for us we'd want to provide guidance to tools, is it better to use a generic term like lymphocyte, or leave it null to say that it doesn't really know.

There is also burden on the researcher. If you discover a new cell type, which would be exciting, you really should go to CL and say hey I think this is new, please add a term for it. If anything, they can give you a CL ID which you can use, put in your paper, etc.

There is also the possibility that the output of these tools is "garbage", not meaningless but not a true unique cell type but instead a mishmash of existing types. Honestly, this might be the most common case. In that, I would suggest instead of loosening cell_subset that another field is added like cell_type_description or cell_phenotype_description that allows you to describe your interpretation of the cell type.

scharch commented 8 months ago

The other question is if the tool isn't assigning an ontology cell type, then what is it doing?

I think it's just competing definition systems. Like, it wouldn't be hard to turn CellTypist labels into a hierarchical ontology, but AFAIK they are not talking to CL and so there's no guarantee of correspondence between the available tags.

I think a combination of "most specific possible" and adding a cell_type_description would work if we document that properly.

schristley commented 8 months ago

The other question is if the tool isn't assigning an ontology cell type, then what is it doing?

I think it's just competing definition systems. Like, it wouldn't be hard to turn CellTypist labels into a hierarchical ontology, but AFAIK they are not talking to CL and so there's no guarantee of correspondence between the available tags.

I see. I searched around to see if there was a mapping between the two and this paper says: "CellTypist uses an expandable cross-tissue cell reference before predicting cell identities with a logistic regression-based label transfer pipeline, with all derived cell types directly interpretable by CL[48]" with reference to this paper. I don't know if that's for real or just pie in the sky.

I could see taking CL as a base and then adding annotations to help the program, as those annotations would not go into the ontology itself, but it would unfortunate if they threw away the cross references.

javh commented 8 months ago

Annotation of cell types can occur through many routes - surface proteins, individual gene markers, clustering and differential expression between clusters, predictions from pretrained models, integration/label transfer from reference datasets, etc. I don't think we need to worry about how the cell type annotations were determined to define a field for them (leave that to DataProcessing).

Having Cell Ontology terms in a cell_type field is super valuable for working with multiple datasets and it'd be nice to encourage it. However, CL isn't always going to be granular enough for every study. Lots of single-cell work involves classifying novel (or study-specific) cell types or states, which CL won't have covered. It'd still be nice to have the appropriate parent term though, so you have something you can use to harmonized across data sets, even if it's not as granular as the study authors' original annotations.

A lot of the time, we end up using multiple layers of annotation, like:

Cell Ontology is good for eliminating the need for level 1 and covering level 2, but I think we'll probably still need a free text field in addition to the CL field if we want to include fields for cell type annotation in Cell (which seems like something we should include). cell_type_description doesn't seem like the right semantics to me though, because the free text field is still an annotation.

scharch commented 8 months ago

I searched around to see if there was a mapping between the two

100 brownie points for doing actual work instead of making lazy assumptions like me :)

schristley commented 8 months ago

I searched around to see if there was a mapping between the two

100 brownie points for doing actual work instead of making lazy assumptions like me :)

Nice! A brownie and coffee sounds good right about now... Total digression but I was at NIH last week for a program meeting, and we met up with Richard Scheuermann (who's now at NLM). He's doing his own cell type discovery with single-cell and was comparing his results against the human cell atlas results and was finding some "intriguing" (that's likely not the word he used ;-) differences. Anyways, I think cell types are going to be hot topic over the next 5 years if they aren't already.