airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

what is the expected workflow/use case for subject-specific germline sets? #595

Closed schristley closed 2 years ago

schristley commented 2 years ago

If we get to point of using subject-specific germline sets, this has the potential to create many germline sets, potentially one for each subject repertoire in the ADC. Is this the desired design?

By subject-specific germline set I mean a workflow like this:

When that repertoire is loaded into the ADC, so too does the custom germline set.

I think this is the most obvious design but there are alternatives. This might also inform the decision about normal/denormal for AlleleDescription in GermlineSet. Presumably in the above, there would be many germline sets but the allele descriptions would be the same record across many of them.

williamdlees commented 2 years ago

Thanks Scott. This is an important question and I'll do my best to address it succinctly.

Leaving tooling workflow aside for the time being, and referring to the germline documentation , a germline set is a curated set. Normally it would exist in a repository, and normally it would represent all the genes and alleles found in a species or a sub-population. A genotype is the subset of genes and alleles found or inferred in an individual - listed with reference to a germline set, but allowing for additions and exclusions. For an analogy, think of a questionnaire in which the answers to each question are multiple choice, but with the option for the responder to provide their own free text answer, or to decline to answer a particular question altogether. If each question is a gene and the multiple choice answers are the alleles, the set of all questions and multiple choice answers is like a germline set. A specific response to the questionnaire is like a genotype. You could envisage storing the complete questionnaire alongside each response, as both are needed to understand the response in full, but it would probably be better to store a single copy of the questionnaire somewhere, and refer to it when storing a response, in case there were multiple versions of the questionnaire.

This is what we do in VDJbase. If you go to a samples page, and click on one of the icons in the Genotype column, you'll see a bunch of reports which detail the genotype and some statistics around V-gene usage. They don't include a germline set. A reference to the germline set with which the samples were annotated is available in the sample metadata (click on the name of a sample in the first column to see this). This is what I would recommend for an ADC - store a reference to the germline set, which can be stored centrally in a location managed by the ADC, or in a trusted third party repository such as OGRDB. Store a genotype with the repertoire, and base statistics on gene usage etc on a repertoire which reflects the genotype.

Turning to tools and how they work - I don't think this is so relevant to the question but as you brought it up it may be worth covering. Tigger is a function library, which can be used in various ways, but the core use case doesn't involve creating a modified germline set. It has some functions to infer a genotype, in its own format (you can see the Tigger format in VDJbase, under that genotype icon). It has another function which will add the genotyped V-calls to a changeo file. This doesn't involve a re-run of the annotation tool. IgDiscover has a rather different use case and typically runs multiple passes of inference and annotation, for which it does use modified germline sets. Modified germline sets may therefore be used by the tooling workflow, but they are not mandated. In VDJbase and OGRDB we use a tool called ogrdbstats which, among other things, derives a genotype in a single format from the output of the four inference tools we know about: Tigger, IgDiscover, partis and Impre. It doesn't derive a tailored germline set.

Why might it be a bad idea to store tailored germline sets with each repertoire in the ADC? Firstly there are space and complexity issues, the latter of which you alluded to in the issue. Beyond that, as I see it, one function of the ADCs and their gateways is to provide usage statistics for particular alleles, clonal information and so on. These things depend for accuracy and consistency on the underlying germline set used for annotation. Since the IARC was established (roughly the same lifetime as the ADCs) we've affirmed about 30 novel human alleles, and I would say that IMGT has approved at least as many again from genomic sources. Today we know of maybe 200 more from sources available now but not yet confirmed and published, across the human IG/TR loci. So even in the human, which is well studied compared to any other organism, the germline set is not stable now, and will not be stable for some time. If I search for a recently affirmed allele in the ADC, how many repertoires will have been sequenced with a germline set that defines that allele? Likewise if I look for clonal families that are based on that allele? I think it's important to know the base germline set on which annotation was conducted - much more important than having access to a personalised set, which, if needed, can always be reconstructed from the base set and the genotype. And important also, I would say, to re-annotate reasonably frequently with an up-to-date set, in order to maintain consistency of analysis across the repertoire collection. OK, this will be computationally expensive - but maybe dedicate say 5% of available capacity to re-annotation, or whatever can be afforded, and see how far it takes things.

schristley commented 2 years ago

This is what I would recommend for an ADC - store a reference to the germline set, which can be stored centrally in a location managed by the ADC, or in a trusted third party repository such as OGRDB. Store a genotype with the repertoire, and base statistics on gene usage etc on a repertoire which reflects the genotype.

Thank you for the detailed description William. This makes sense, and corresponds to an alternative I was thinking about. It eliminates issues with creating many custom germline sets, through introduces different issues. In particular, we need a way to store a genotype with the repertoire and specify that the germline set was tailored when data processing that repertoire. Right now the schema lacks that.

It also might be worthwhile to have a section in the docs, called something like "data processing with germline sets", which would describe some of these best practices, especially as they cannot be directly enforced in the schema (FYI- these are also great things to discuss in the paper, I can imagine a section that describes an "ideal workflow").

Regarding the normal/denormal question, this makes denormal (i.e., AlleleDescription embedded within GermlineSet) more reasonable under the assumption that germline sets infrequently change. Though we still might want normal form (AlleleDescription and GermlineSet as independent top-level objects) for other reasons.

schristley commented 2 years ago

And important also, I would say, to re-annotate reasonably frequently with an up-to-date set, in order to maintain consistency of analysis across the repertoire collection. OK, this will be computationally expensive - but maybe dedicate say 5% of available capacity to re-annotation, or whatever can be afforded, and see how far it takes things.

I agree this would be nice, particularly for the ADC as study data can become "stale" with old germline sets. Actually, the computational expense isn't that bad. I estimate re-annotating all of the studies in VDJServer is roughly 10,000 SUs which is an insignificant blip for a supercomputer center, and it can be almost completely automated. The real challenge is the data size. Each re-annotation creates about 1TB of data (as of today for VDJServer), not bad when talking about files on disk, but significant when loading into a database for query.

williamdlees commented 2 years ago

It's been mentioned before in IARC that we're overdue for a genotyping best practices paper. I think that's probably the best thing to do with best practices, rather than put them in the AIRR docs. We could perhaps have a simple reference model in the docs, if it would make the use of the different objects clearer,

I can see there might be some reasons for storing >1 genotyped repertoire, but it feels marginal to me. An alternative, if one really wanted to store historical analyses, would be to store a history of the germline sets and genotypes derived at different timepoints, without storing the annotated sequences. That would preserve reproducability. Wouldn't necessarily have to be done as part of the ADC query, it could just be a change log showing when each repertoire was updated, accessible through some means.

schristley commented 2 years ago

It's been mentioned before in IARC that we're overdue for a genotyping best practices paper. I think that's probably the best thing to do with best practices, rather than put them in the AIRR docs. We could perhaps have a simple reference model in the docs, if it would make the use of the different objects clearer,

Sounds good. Maybe something as simple as updating the Data Model page with a few words about the relationship would be sufficient.

I can see there might be some reasons for storing >1 genotyped repertoire, but it feels marginal to me. An alternative, if one really wanted to store historical analyses, would be to store a history of the germline sets and genotypes derived at different timepoints, without storing the annotated sequences. That would preserve reproducability. Wouldn't necessarily have to be done as part of the ADC query, it could just be a change log showing when each repertoire was updated, accessible through some means.

I was thinking it might be as simple as adding a genotype_set_id to DataProcessing. What's the case for storing >1 genotype, is that if you run a different tool?

williamdlees commented 2 years ago

A Genotype only covers a single locus, so the primary reason for the set is to support multiple loci (paired chain, single cell)

schristley commented 2 years ago

A Genotype only covers a single locus, so the primary reason for the set is to support multiple loci (paired chain, single cell)

Oh my mistake, I misread that as meaning >1 genotype set. If it's multiple loci, the GenotypeSet should handle that, thus genotype_set_id in DataProcessing versus just a genotype_id

bcorrie commented 2 years ago

@schristley should this be a v1.4 issue? It seems like a general discussion, with presumably most issues for v1.4 addressed in other issues and pull requests? Can we remove this from the 1.4 milestone?

schristley commented 2 years ago

@schristley should this be a v1.4 issue? It seems like a general discussion, with presumably most issues for v1.4 addressed in other issues and pull requests? Can we remove this from the 1.4 milestone?

Yes v1.4, it's resolved by adding germline_set_ref to DataProcessing with #611