airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Add property_type to CellExpression #700

Closed bcorrie closed 11 months ago

bcorrie commented 1 year ago

Add a property_type so we can differentiate between the types of properties that exist for a specific Cell

Closes #699

bcorrie commented 1 year ago

We have a 10X study we are working on, and it has the following types of counts in the features.tsv file.

C0258   FB_hash8        Antibody Capture
C0259   FB_hash9        Antibody Capture
C0260   FB_hash10       Antibody Capture
C0531   FB_dex31        Antibody Capture
C0532   FB_dex32        Antibody Capture
C0533   FB_dex33        Antibody Capture
C0063   FB_CD45RA       Antibody Capture
C0148   FB_CCR7 Antibody Capture
C0034   FB_CD3  Antibody Capture
ENSG00000243485 MIR1302-2HG     Gene Expression
ENSG00000237613 FAM138A Gene Expression
ENSG00000186092 OR4F5   Gene Expression

This is the first study we have processed with some sort of feature barcoding, and it includes what we consider "normal" in the studies we have loaded, which is 10X "Gene Expression". This study also has feature barcodes for cell phenotype, samples using hashtag feature barcodes, and epitope specificity using dextramers.

The Gene Expression are processed correctly, resulting in:

$ curl -d '{"size":1}' https://repository-staging.ireceptor.org/airr/v1/expression
{"Info":{
[Stuff Deleted]
}, "CellExpression":[
{
    "expression_id": "6494c7c178fea0c15161aacb",
    "cell_id": "648ced310556ffe55e55beef",
    "repertoire_id": "PRJNA744851-B3_VAX2_INF_CELL",
    "data_processing_id": "PRJNA744851-B3_VAX2_INF",
    "property": {
        "label": "PRIM1",
        "id": "ENSG:ENSG00000198056"
    },
    "value": 1,
    "adc_annotation_cell_id": "AAAGTAGGTCTGCAAT-5",
    "ir_annotation_set_metadata_id_expression": "648cd4655f86d976c84729bd",
    "sample_processing_id": "PRJNA744851-B3_VAX2_INF_CELL",
    "ir_created_at_expression": "2023-06-22T22:14:24.640405+00:00",
    "ir_updated_at_expression": "2023-06-22T22:14:24.640405+00:00"
}]}

Whereas a feature barcode looks like this currently:

$ curl -d '{"filters":{"op":"=","content":{"field":"property.id","value":"C0063"}}}' https://repository-staging.ireceptor.org/airr/v1/expression
[Stuff Deleted]
{
    "cell_id": "648cefb5af95bc2a945a4792",
    "property": {
        "label": "FB_CD45RA",
        "id": "C0063"
    },
    "value": 32,
    "ir_annotation_set_metadata_id_expression": "648cd4715f86d976c84729e8",
    "adc_annotation_cell_id": "AGTTGGTGTCCTCCAT-5",
    "repertoire_id": "PRJNA744851-R12_INF_VAX2_CELL",
    "data_processing_id": "PRJNA744851-R12_INF_VAX2",
    "sample_processing_id": "PRJNA744851-R12_INF_VAX2_CELL",
    "ir_created_at_expression": "2023-06-23T00:02:47.998733+00:00",
    "ir_updated_at_expression": "2023-06-23T00:02:47.998733+00:00",
    "expression_id": "6494e129a9a6417ffa684350"
}

The problem is there is no way to tell whether a give CellExpression property is a "Gene Expression" property or an "Antibody Capture" property that is being used to determine Cell Phenotype, Cell Specificity, or some other feature...

We do not yet have a mapping to the ABREG registry for these antibodies yet either (so no CURIEs yet in the property.id), but that is relatively easy we think.

bcorrie commented 1 year ago

We are of course not sure what exactly makes sense in the enum for property_type so open to suggestions. The four cases currently listed reflect the three uses of feature barcoding that we have in this study + gene expression. I am sure there are others.

bcorrie commented 1 year ago

We would suggest having something like:

    "property_type": "gene_expression",
    "property": {
        "label": "PRIM1",
        "id": "ENSG:ENSG00000198056"
    },

and

    "property_type": "surface_protein_expression",
    "property": {
        "label": "FB_CD45RA",
        "id": "C0063"
    },
bcorrie commented 1 year ago

@bussec you are the obvious one to ping on this, but other input is of course welcome.

bcorrie commented 1 year ago

Suggested change is here: https://github.com/airr-community/airr-standards/blob/643321a72cf0af0f4d198795f80ebd5688e11234/specs/airr-schema.yaml#L4366

bcorrie commented 1 year ago

Not sure if we need to update expression_study_method as well:

https://github.com/airr-community/airr-standards/blob/a3ce1ca2e8b38301514707ee48619a59d4b741aa/specs/airr-schema.yaml#L4273C28-L4273C28

bussec commented 12 months ago

@bcorrie I agree with having the field in general, but I have some issues with some of the currently proposed values:

gene_expression and surface_protein_expression are fine, but hashtag_expression is a subtypes of surface_protein_expression (using another detection technology) and dextramer_expression would be even more specific that this (in addition "Dextramer" is a trademark, so we should avoid using it).

So we either introduce a property_detection_method field or change the values to something like fluorescense_based_protein_expression, dna_tag_based_protein_expression, etc..

scharch commented 12 months ago

I think a property_dectection_method field makes sense. In theory ICS or FISH are gene_expressions that are measured by fluorescence... I am ok with hashtag_expression being a separate entry in enum despite it technically being a subset of surface_protein_expression. Ditto for dextramer_expression, but to avoid copyright issues, maybe we could combine dextramer barcoding and variants of LIBRASeq as something like antigen_specific_receptor_expression? (last edited per call)

javh commented 12 months ago

I think it will be nigh-impossible to enumerate single-cell modalities. The field is evolving pretty rapidly. I think we'd need an other if we want to go the enum route.

javh commented 12 months ago

From call:

bcorrie commented 11 months ago

I now have:

        property_detection_method:
            type: string
            description: >
                Keyword describing the detection method used to measure the property value. The following keywords
                are recommended if condsidered appropriate but custom methods can be specified: "gene_expression",
                "surface_protein_expression", "antigen_specific_receptor_expression", "hastag_expression"
            x-airr:
                miairr: defined
                nullable: true
                adc-api-optional: true
bcorrie commented 11 months ago

@javh @scharch @bussec @kira-neller does this cover it?

bussec commented 11 months ago

@bcorrie I think we decided to have property_type and property_detection_method, but currently we only have the latter (although its description still sounds more like property_type). Am I wrong about this?

bcorrie commented 11 months ago

@bussec I could not remember and in reading the above from our meeting that was unclear to me. The way I interpreted the above was we wanted the field but didn't like the field name nor the values. So I changed the field name to property_detection_methods and added the "keywords" to the string field that are in the issue above.

What do the different fields (property_type and property_detection_method) represent. I think I need someone to provide some clarity. I can change the field name back to property_type but I don't know what property_detection_method is then. I need some guidance 8-) If someone can give some specifics I will add to the spec.

bussec commented 11 months ago

@bcorrie

Happy to help with the terms, I just wasn't sure anymore whether we agreed on one or two keys.

bcorrie commented 11 months ago

@javh @scharch can you add your thoughts/recollections?

scharch commented 11 months ago

what @bussec said

javh commented 11 months ago

I'm not sure we need two fields. The method seems like something that would be captured earlier in cell/sample processing.

javh commented 11 months ago

From the call:

bcorrie commented 11 months ago

Closing this pull request without merge - new branch with pull request: #719