airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Annotation of absence of specific MHC alleles #678

Closed ndalchau closed 5 months ago

ndalchau commented 1 year ago

How should one represent information about a subject not expressing a particular MHC allele? In the study published here, there is information about which subjects express HLA-A*02:01 (A2) and which do not.

For the A2-positive subjects, you can specify a single entry in mhc_alleles, like this:

subject:
  genotype:
    mhc_genotype_set:
      mhc_genotype_list: [ 
        mhc_class: "MHC-I"
        mhc_alleles: [
          { allele_designation: "02:01", gene: { id: "MRO:0000046", label: "HLA-A"} }
        ]
      ]            

But what is recommended for A2-negative subjects? Leaving mhc_alleles or mhc_genotype_list empty is not satisfying, because that doesn't communicate the fact that they do not express A2.

Maybe an extended schema could incorporate an additional genotype, such as non_expressed_mhc_genotypes, but I can imagine there would be reluctance to extend the schema for this purpose.

bussec commented 1 year ago

I agree that it would be an advantage if we could annotate this kind of information. Could you clarify whether "non-expressed" describes a situation in which

A. An allele (e.g., HLA-A*02:01) is present in the subject's genome but not transcribed, B. An allele is absent from the subject's genome, as determined by a sufficiently sensitive and controlled assay, C. The genomic presence or absence of an allele is unknown, but no transcripts are detected by a sufficiently sensitive and controlled assay or D. A combination of the above.

For IG/TR genes we have a mechanism to annotate deleted genes (Genotype.delete_genes). We can discuss whether we could use a similar mechanism to MHC, but we need to understand the experimental support first.

ndalchau commented 1 year ago

Hi @bussec , thanks for your fast response on my question. By non-expressed, I believe I'm referring to the allele being absent from the subject's genome. This is my interpretation of the paper (https://jitc.bmj.com/content/8/2/e001631.long#DC1). Unfortunately, they don't provide any detail on the assay used to determine the presence/absence of HLA-A2.

I hadn't come across Genotype.delete_genes before, but now I can see that in the schema docs. While the allele is not "deleted" in this case, perhaps delete_genes is the most appropriate location to annotate the information within the current schema.

bcorrie commented 1 year ago

@ndalchau unfortunately the deleted_genes field belongs to the Genotype object (IG/TR Genotype), and it does not belong to the MHCGenotype object. So although we have the use case for IG/TR Genotype we don't have that for MHC and therefore you can't use that mechanism for MHCGenotype as the AIRR Spec currently stands. We could add this capability to the spec once we understand the use case better, but there isn't a solution currently I am afraid.

bcorrie commented 5 months ago

@bussec @scharch can/should we address this for the v2.0 release?

scharch commented 5 months ago

I don't see an easy way to do this in the current schema and I'm not clear on the use case or how often it would arise. Lean to close...

bussec commented 5 months ago

I just changed the subject of this issue to something that I consider to be more appropriate of the proposed use case.

Thinking about the use case: In a world in which HLA typing is done by sequencing, this use case does not exist, as you need to sequence both HLA-A alleles to determine their alleles, so you will always have an positive answer. However, HLA typing can also be performed by staining cells with allele specific antibodies, and you could think of an experimental setup in which the researcher was only interested in learning whether a given individual expresses (and therefore harbors) at least one HLA-A*02 allele. The question is how to annotate a situation in which this test comes back negative. Given the high frequency (~50%) of this allele among people of European descent, also its absence might be something that is useful to know explicitly (instead of just not annotating anything at all). However, I expect to see rather more than less sequencing-based allele identification in the future, so the amount of data with such narrow HLA typing should decrease.

Also, using deleted_genes (if we would copy it from the Genotype object) would be incorrect, as there is no indication that HLA-A was deleted on any of the two chromosomes, it is just the information that neither of the two alleles is *02. Therefore we would need a new mechanism to annotate the information and I am not a big fan of "negating keywords" (e.g., non_detect_mhc_alleles). Which leaves us with the option of a more expressive language and support for predicate other than "is a". Which sound a lot like AIRR Schema 3, so let's close this ;-)