Integrating Germline and MHC into AIRR data model

bcorrie commented 4 years ago

Trying to capture an email discussion that grew out of the Common Repository Working Group and has been on the back burner with all of us for some time... How do we annotate AIRR-seq data sets with Germline and MHC information?

This is somewhat related to #157 and #258 but I am starting a new thread/conversation specifically for this...

bcorrie commented 4 years ago

My initial thoughts on this, from the email thread...

It makes sense to me to treat these things similar to how we treat ontologies... That is, there is an ID for an entity (say a specific germline gene) and a provider for the definition of that entity (say OGRDB and/or IMGT), we have:

    curie_prefix: GERMLINE
    iri_prefix:
      - "http://ogrdb.org/germline/"
      - "http://imgt.org/germline/"

This assumes that something like http://ogrdb.org/germline/ and http://imgt.org/germline/ are ontology providers with lookup URLs for the definition of a germline gene ID.

Then if we are associating a set of germline genes with a subject we would have something like the following in the Subject object of our schema:

        subject_germline_list:
            type: array
            description: List of germline genes for this subject
            items:
                $ref: '#/Ontology'
                description: Germline gene
                title: Germline gene
                example:
                    id: GERMLINE:XXXX
                    label: GENELABELFORXXXX

This would result in each subject having an array of fields as follows:

{subject_germline_list:
[
    {id:"GERMLINE:XXXX",label:"GENELABELFORXXXX"}, 
    {id:"GERMLINE:YYYY",label:"GENELABELFORYYYY"}, 
    {id:"GERMLINE:ZZZZ",label:"GENELABELFORZZZZ"},
    ...]
}

Using the AIRR CURIE, one could look up the definition of any of these object by resolving to http://ogrdb.org/germline/XXXX where XXXX is the ontology ID for a specific germline gene and would return the relevant definition for XXXX, including the human readable label for that gene (GENELABELFORXXXX). Presumably this would return data in a standard format (as per the germline set spec), perhaps:

  - description_id: airr1-1
    organism: Homo sapiens
    sequence_name: X07448
    alt_names:
      - IGHV1-2*01
    locus: IGH
    domain: V
    functional: true
    gene_subgroup: "1"
    subgroup_designation: "2"
    allele_designation: "01"

Technically, this is not hard to add to the spec... As per usual with Germline discussions, the problem is agreeing what XXXX and GENELABELFORXXXX are.

Are we using GERMLINE:airr1-1, GERMLINE:X07448, GERMLINE:IGHV1-2*01

Logical first guess, assuming sequence_id above is unique, we might have as an AIRR Repertoire something like:

{subject:{subject_id:"Subject 01", subject_germline_list:[{id:"GERMLINE:X07448", label:"IGHV1-2*01"},{id:"GERMLINE:HM855674", label:"IGHV1-2*05"}, ...]}}

Of course we would need an ontology provider that would be able to resolve a CURIE like "GERMLINE:X07448" to a URL like http://ogrdb.org/germline/X07448 and respond with the definition above or something similar....

williamdlees commented 4 years ago

Excuse me if I miss out some of the earlier discussion.

I do not think this will be useful unless people run a genotyping tool on their sequence annotations, e.g. Tigger, IgDiscover, or partis (which has it built in). But doing so will greatly improve the annotation quality anyway.

For OGRDB, this is how we document gene usage today:

We defined a standardised format for a genotype and provided a tool that will create that format from the commonly used inference tools. That’s detailed here: https://github.com/airr-community/ogrdbstats

Here’s a sample file in that format: https://github.com/airr-community/ogrdbstats/blob/master/example_ogrdbstats_genotype.csv

Here’s what a genotype looks like in OGRDB (the genotype is the final table on the page) https://ogrdb.airr-community.org/genotype/14

OGRDB will return information on genes, iven the IMGT name. You can find the API definition and a test harness here: https://ogrdb.airr-community.org/api/

It could take years for an ontology to be defined, but you could implement the above today, and move to ontology when it's available.

Best wishes

William

bcorrie commented 4 years ago

It could take years for an ontology to be defined, but you could implement the above today, and move to ontology when it's available.

Sorry, I used much too strong of language - I agree with you, we don't want to wait for an ontology, but we should apply some rigor to how describe the terms in the standard so that we have a mechanism to look thing up (e.g. in OGRDB or similar).

The AIRR Ontology approach (from the Ontovoc subgroup) provides the mechanism - there is a information provider (e.g. OGRDB) and an ID (e.g. a gene name) - and that is what I meant to suggest we try to do... What we want to be able to do (IMHO) is be able to swap out information providers and still use the same ID and get a definition for that ID.

I agree we don't need a formal ontology to get started...

bussec commented 4 years ago

My thoughts on this:

I agree that we should not wait for a proper ontology, but that we can borough a lot from what we have learned with ontologies, and I am quite convinced that we also need to do this. In the end the biology of Ig/TCR loci has a certain complexity and we need a representation that must be able to capture this.
Do we think that this is only useful when it is used with a subject-specific germline reference set? IMO the answer to this is "no", i.e., it is also useful for any other annotation but in this case the reference needs to move from the subject level to the rearrangement level.
Regarding @bcorrie's question "[...] the problem is agreeing what XXXX and GENELABELFORXXXX are. Are we using GERMLINE:airr1-1, GERMLINE:X07448, GERMLINE:IGHV1-2*01 ?", we need to distinguish between: a. the abstract concept of a certain IGHV gene, which b. has the local ID airr1-1 and c. has the label (=gene symbol) IGHV1-2*01 and d. could have a PID (PURL, IRI, DOI, HANDLE, whatever flavor you like). We need to distinguish this from e. observations (=instances) of this concept, e.g., the sequence with the Genbank ID X07448. To simplify things a bit, we can assume for now that the local ID is actually a global ID (but not a full-fleged PID), so that it will always refer to the same concept in different repositories, so that the CURIEs would work even without proper PIDs. Long story short, IMO we should use GERMLINE:airr1-1.

williamdlees commented 4 years ago

Sorry Christian, would you mind expanding on '...the reference needs to move from the subject level to the rearrangement level'? Thanks.

bussec commented 4 years ago

Sorry Christian, would you mind expanding on '...the reference needs to move from the subject level to the rearrangement level'?

Sure, sorry for being fuzzy: If we want to annotate the comprehensive germline gene set of an individual, then this should be done in the Subject object, as it should not change between observations.

However, if we do not have subject-specific germline gene set (i.e. annotating against the default database), then the linkage to the records of the database has to happen on the level of the Rearrangement. There are some other fringe cases (chimeric/transgenic animals) where this approach would be required.

If having the references on both level was the idea the whole time and I just didn't realize it, then you can ignore this comment.

williamdlees commented 4 years ago

OK, thanks, I understand. I'd say that, at the subject level, the information provided is an inferred genotype (or subject-specific germline set, if you prefer). At the rearrangement level, it's expressing ambiguity in the annotation (there will be fewer cases of ambiguity if the subject-specific germline set is available, but there will still be some).

A further point to consider, in the case of subject-specific germline sets, is how to handle novel inferences, i.e. those sequences assigned to an inferred gene that is not part of the reference set at the time the inference is made, Both how these should be labelled at the time of annotation, and whether any update or review is required should the reference set be extended subsequently. I;d be happy to demomstrate how this is addressed in OGRDB today, if that's of interest.

bcorrie commented 4 years ago

OK, thanks, I understand. I'd say that, at the subject level, the information provided is an inferred genotype (or subject-specific germline set, if you prefer). At the rearrangement level, it's expressing ambiguity in the annotation (there will be fewer cases of ambiguity if the subject-specific germline set is available, but there will still be some).

This doesn't need to be annotated at the rearrangement level does it. That is, there isn't a different germline set that is used for each rearrangement is there? Isn't this a feature of the DataProcessing object. That is, a specific annotation tool uses a specific germline set to produce a set of rearrangement annotations.

I can see three levels where one would have a germline set:

At the subject level, either inferred genotype or sequenced
At the sample level, inferred genotype from the annotations for that sample using an inference tool
At the data processing level, denoting the germline set used for that annotation, assuming that if one used either the subject or sample level genotype the annotation would be "higher quality" than if one used a general germline database.

Note we have a field for this, called the germline_database but it is very loosely defined and is a simple text field. I think one of the key things we are exploring is how to specify that more clearly, in particular if one has a more specific germline than "IMGT" or "MiXCR v2.0.1 default"... 8-)

bussec commented 4 years ago

@bcorrie

This doesn't need to be annotated at the rearrangement level does it? That is, there isn't a different germline set that is used for each rearrangement is there?

No, there should not be different germline sets for a Subject as... well... it's in the genes :wink: Therefore I don't think that we need an additional reference on the Sample level, as this should always be the same.

However, if the Subject-specific genotype record is not available, users will search against a reference database (annotated in DataProcessing) that likely will contain more than two alleles for some of the genes and might change over time (in contrast to the Subject record, which I assume is stable). In this situation you do not only want to know what DB you blasted against, but also the result for each Rearrangement, which requires stable identifiers and unfortunately gene symbols haven't been stable in the past. Therefore we need a link to a stable record of the reference gene.

bcorrie commented 4 years ago

No, there should not be different germline sets for a Subject as... well... it's in the genes 😉 Therefore I don't think that we need an additional reference on the Sample level, as this should always be the same.

The reason I suggested that this might be useful is that it is my understanding that tools that infer germlines are often applied at the repertoire level, with a genotype and haplotype often attached to a repertoire/sample. I believe that is what VDJbase stores (https://www.vdjbase.org/data/Samples). So is it wrong to store this at the sample/repertoire level if that is an analysis that has been performed for that repertoire? Does the AIRR Community take the "high road" and not allow genotype to be stored for a repertoire (as an analysis output), but instead ensure that an inferred genotype is linked with the subject only?

williamdlees commented 4 years ago

As it’s an inference, my vote would be to store it with the repertoire that it was inferred from, and this is indeed what we do in VDJbase. It wouldn’t, perhaps, be too surprising to find some variation in the inferences for some of the less frequently expressed genes.

For reproducability, it would be good to store also the reference set that was used to do the initial annotation before the genotype was inferred,

bcorrie commented 4 years ago

If I was to envision such a process, I would likely store the original germline used for annotation (e.g. IMGT) in the DataProcessing object (e.g. data_processing.germline_database = "IMGT 2020-06-23" for a given SampleProcessing. In this object I would likely include in the DataProcessing that I used Tigger to infer a germline for the repertoire (e.g. data_processing.software_versions = "..., Tigger/x.y"). It probably makes sense that the inferred germline gets stored in the DataProcessing object from which it was inferred. (e.g. data_processing.inferred_germline = "XXX").

If I was then to rerun an annotation tool for that SampleProcessing using the inferred germline, I would create a second DataProcessing object for the repertoire, record the germline for annotation as the inferred germline (e.g. data_processing.germline_database = "XXX"). One might want to store the inferred germline for this data processing as well (e.g. data_processing.inferred_germline = "XXX")? In this way you could tell that the germline used for annotation was the inferred germline.

If one was to then infer the germline for a subject, based on one or more inferred germlines from multiple repertoires from the subject, it would make sense to have an inferred germline at the subject level (e.g. subject.germline = "XXX + YYY" where YYY might be added information for the subject germline that was found in another repertoire from that subject??? In this case, since it is possible for a subjects germline to be inferred or not inferred, we might need track whether the subject's germline was inferred or not (e.g. subject.germline_process = ["inferred"|"sequenced"|???])

In this model, we really only need three new fields:

data_processing.inferred_germline
sample.germline
sample.germline_process We of course also need a mechanism to store a germline XXX 8-)

Does that make sense as a starting point? Can you poke holes in my thinking??? 8-)

williamdlees commented 4 years ago

Hi Brian,

I think it's a good approach, but here are a couple of things to consider:

It's certainly useful to store the date on which the reference set was downloaded from IMGT, and many people do, but I am not sure that there is a way of retrieving the set as it was on that day. For full information, it would be better to store the set itself (which is not that large, for a particular species and locus). This may get more important in the future, if people start to use sets published from various sources - until we get a good, referencable, solution.

For the source of the genotype or haplotype - given the different methods available, - it might be better to record the sequencing method rather than 'inferred' or 'genomic' as there can be inference involved in the 'genmoic' methods. Examples could be:

Repertoire sequencing
Long read next-generation genomic sequencing
Short read next-generation genomic sequencing
Sanger sequencing of genomic material

bussec commented 4 years ago

For full information, it would be better to store the set itself

Fully agree, just that this potentially infringes IMGT IP (as per Art. 7 (2) a 96/9/EC) under their current license. So the three possibilities would be:

IMGT starts providing referenceable versions of their database
IMGT licenses under CC-BY (i.e., without NC-ND restrictions)
Use a third-party non-for-profit service to performs regular snapshots from IMGT and create references to these snapshots (something like perma.cc)

For the source of the genotype or haplotype - given the different methods available, - it might be better to record the sequencing method rather than 'inferred' or 'genomic' as there can be inference involved in the 'genomic' methods.

Also agreed. I am wondering whether we a) could reuse some of the existing vocabulary for sequencing methods and b) should actually annotate both, sequencing and data processing.

williamdlees commented 4 years ago

It would not take very long to set up a github repo that snapshotted the IMGT reference set regularly, and baselined copies on each change. I think this would be allowed under their current CC-BY-NC-ND licence, provided similar terms were applied to the result.

bcorrie commented 4 years ago

For the source of the genotype or haplotype - given the different methods available, - it might be better to record the sequencing method rather than 'inferred' or 'genomic' as there can be inference involved in the 'genmoic' methods. Examples could be:

Repertoire sequencing

Long read next-generation genomic sequencing

Short read next-generation genomic sequencing

Sanger sequencing of genomic material

@williamdlees do you have any suggestions for the controlled vocabulary for this? Currently have the following:

            enum:
                - sequenced
                - inferred

williamdlees commented 4 years ago

If we really want to boil it down we could use

derived from:

genomic sequencing
repertoire sequencing

bcorrie commented 4 years ago

Changed this here: https://github.com/airr-community/airr-standards/pull/438/commits/61a232a3d0746b1eef77486cd235997da76279e9

bcorrie commented 4 years ago

All, there is some active discussion around this in the pull request here: #438

If you are interested you might want to review/comment.

airr-community / airr-standards

Integrating Germline and MHC into AIRR data model #416