Decomposing of variants - requirements for database matching

HillJamie commented 2 months ago

I have been trying to understand the consequences of the last two slides here https://github.com/GTB-tbsequencing/mutation-catalogue-2023/blob/main/Final%20Result%20Files/MCNV.pdf and what they mean for how a variant caller should/shouldn't decompose variants in order to find the correct matches in the database. I have three questions:

Is it true that, with the exception of deletions and insertions, graded-variants cover at most one codon? Then larger genomic variants are decomposed into per-codon graded variants. This seems to be the case, for example, this length 7 MCNV is decomposed into variants that affect at most one codon:

fgd1_c.510G>C | NC_000962.3 | 491290 | GTGCCCG | ATCCCGA -- | -- | -- | -- | -- fgd1_p.Val170Ile | NC_000962.3 | 491290 | GTGCCCG | ATCCCGA fgd1_p.Val172Ile | NC_000962.3 | 491290 | GTGCCCG | ATCCCGA fgd1_c.513C>G | NC_000962.3 | 491290 | GTGCCCG | ATCCCGA

If so, then does this mean that a variant caller need not call the same multiple consecutive nucleotide variants as FreeBayes might, but only codon-aligned variants of at most length 3 (plus insertions, deletions, and the occasional combined insertion + deletion) in order to recover all relevant variants in the catalogue?

Similarly, if a variant caller tends to call larger variants than FreeBayes, then should these be decomposed into codon-aligned variants of length 3?

On a more conceptual level... If MCNVs are decomposed into graded-variants that cover at most one codon, what happens if two adjacent amino acid changes are required to confer resistance? As a concrete example, these changes have been seen together, but only dnaA_p.Ile193Val has been graded:

dnaA_p.Ala194Ser | NC_000962.3 | 577 | ATCGCA | GTCTCG -- | -- | -- | -- | -- dnaA_p.Ile193Val | NC_000962.3 | 577 | ATCGCA | GTCTCG

 Hypothetically, what would happen if IleAla -> ValSer confers resistance, while either substitution in isolation does not.

Finally, I am hoping if you could help me understand the phrase "we have included all theoretical single, multiple, constant-length genomic-variants in our coordinate data" from this file https://github.com/GTB-tbsequencing/mutation-catalogue-2023/blob/main/Final%20Result%20Files/Instruction%20of%20use%20for%20incorporation%20of%20the%20mutations%20catalogue%20version%202%20results%20into%20bioinformatic%20pipeline.pdf How would a Serine -> Serine synonymous mutation be represented for example... it is the "single, multiple" part that confuses me.

Thank you for your help, Jamie

sachalau commented 2 months ago

Hi again,

Yes that is true. Our features in the association grading model (i.e. what I usually call graded-variants) can be nucleotide or amino acid change features, but for the amino acid features, we always used as input single codon changes in the case of missense change.

If so, then does this mean that a variant caller need not call the same multiple consecutive nucleotide variants as FreeBayes might, but only codon-aligned variants of at most length 3 (plus insertions, deletions, and the occasional combined insertion + deletion) in order to recover all relevant variants in the catalogue?

If you intend to solely recover graded-variant (and I don't see why why should aim for more in the context of the mutation catalogue v2), then yes, as long as your variants are also normalized, then this is enough. As mentioned in the documentation, all our freebayes variants have been normalized with bcftools norm.

Similarly, if a variant caller tends to call larger variants than FreeBayes, then should these be decomposed into codon-aligned variants of length 3?

Yes, if your haplotype length is larger than the one we used for the catalogue, then you will need to decompose the variants into codon-aligned normalized variants. Then the corresponding genomic coordinate will be found in our data.

We haven't looked at multiple variant associations in our algorithm. Our current algorithm is built so that it's actually impossible to do so (because co occurring variants are masked).
Sorry, it's a bit unclear, but that sentence does not apply to synonymous variant (it's clearer with all the quote):

For every gene of our candidate gene list, we have included all theoretical single, multiple, constant-length genomic-variants in our coordinate data. Following that, every possible genomic-variant leading to a missense or non-sense graded-variant will appear in the coordinate data, even if we did not observe that genomic-variant in our database of samples

Imagine we have found a missense variant to be associated with resistance to a particular drug. However, in all our samples that were included in the association algorithm, we only found this missense as a consequence of a unique nucleotide variant (for instance a single mutation on the codon). This sentence means that in our genomic coordinate file, you will find this single nucleotide variant, and in addition all other theoretical variants (MCVN or not) that theoretically lead to the same consequence (although we have never seen it in any sample in our database). Idem for stop gained mutations (anywhere in the gene).

HillJamie commented 2 months ago

Thank you. I think I understand this now.

in addition all other theoretical variants (MCVN or not) that theoretically lead to the same consequence

Just to check, are these MCNVs only up to 1 codon long? So for example, there may be some SNVs describing changes in e.g. the 1st base of the codon, but there may also be MCNVs of length 3 describing changes to the 1st and 3rd bases only. There will not be any MCNV longer than 3nts.

On a slightly tangential note to 2), but maybe of benefit to anyone else who reads this issue (and I’m hoping I’ll be ccorrected if wrong 😉 ):

As noted in the final bullet point of https://github.com/GTB-tbsequencing/mutation-catalogue-2023/blob/main/Final%20Result%20Files/Instruction%20of%20use%20for%20incorporation%20of%20the%20mutations%20catalogue%20version%202%20results%20into%20bioinformatic%20pipeline.pdf

we cannot ensure that all varying-length graded variants are associated with all genomic-variant… Flagging of these variants will require to be implemented in an additional step.

In the case of amino acid insertions, it is clear that the nucleotide sequence in the “alternative_nucleotide” column is one of many that give rise to the same amino acids.

For amino acid MCNVs, it is not so obvious how to flag these variants. But the reply on this issue states that one possible approach is to decompose longer variants into a sequence of variants for each codon before performing exact matching. With respect to the example in question 2, imagine that instead of the sequence in the catalogue, we observe ATCGCA -> GTATCG, which is still IleAla -> ValSer. Then we expect resistance, and can find it by decomposing the variant into ATC -> GTA and GCA -> TCG.

sachalau commented 2 months ago

Just to check, are these MCNVs only up to 1 codon long?

The MCNVs that we created systematically/programmatically (as opposed to those created through genotyping by freebayes) are indeed at most one codon long. Those created by freebayes can be longer than one codon.

With respect to the example in question 2, imagine that instead of the sequence in the catalogue, we observe ATCGCA -> GTATCG, which is still IleAla -> ValSer. Then we expect resistance, and can find it by decomposing the variant into ATC -> GTA and GCA -> TCG.

That's correct. But here the decomposition is quite easy, as the two resulting variants do not need normalization. But normalization should be implemented after the decomposition.

HillJamie commented 2 months ago

Thank you again!

GTB-tbsequencing / mutation-catalogue-2023

Decomposing of variants - requirements for database matching #10