GTB-tbsequencing / mutation-catalogue-2023

MIT License
12 stars 1 forks source link

Synonyms in genomic coordinates sheet #9

Closed HillJamie closed 2 months ago

HillJamie commented 2 months ago

In the Genomic_coordinates sheet of this file https://github.com/GTB-tbsequencing/mutation-catalogue-2023/blob/main/Final%20Result%20Files/WHO-UCN-TB-2023.6-eng.xlsx there are some variant names in the first column that are missing from the Catalogue_master_file sheet.

In all cases when this occurs, there is another variant with the same position, reference_nucleotide, and alternative_nucleotide that is present in the Catalogue_master_file sheet.

An example is found on rows 29 and 30: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

dnaA_p.Thr10Ala | NC_000962.3 | 28 | ACCACA | GCGACG -- | -- | -- | -- | -- dnaA_c.33A>G | NC_000962.3 | 28 | ACCACA | GCGACG

In this case, the variant dnaA_p.Thr10Ala is missing from the Catalogue_master_file sheet. How should these extra variants be treated? Are they synonyms or is there a deeper meaning?

Best wishes, Jamie

sachalau commented 2 months ago

Hi Jamie,

Thank you for taking the time to raise this issue.

Indeed there are variants in the genomic_coordinates sheet that do not appear in the catalogue master file sheet, meaning we haven't been able to grade their association according to any drug.

To address first your question, no these variants are not necessarily synonymous (the one you highlight is a missense) and there is no deeper meaning then "we have no information to provide regarding this variant". Should you encounter it, you should report it as-is . There is no grading information to report alongside these variants.

To give a bit more details, this is intended and a bit complex to detail in full. It's very rapidly described in this presentation (last two slides) but I'll try again.

The reason these variants exist in our database is that we have at least seen it once in any of our isolates, but the particular isolates carrying it were never included in any of the input for the drug association algorithm. It can be the case because the variant was actually deemed as very low quality, or because the isolate in general was filtered because of our QC.

This mainly happens for what I call multiple consecutive nucleotide variants. The one you give as an example is a complex variant which has 4 changes over a span of 6 nucleotides. Our variant algorithm has determined that the change at position 28 ACCACA => GCGACG leads to changes in two adjacent codons, codon 10 being change from thr and ala and a synonymous variant at the position 33 of the transcript. The first variant as we said, is ungraded, however the second one, is not. It's a synonymous variant and it's class 4 for INH.

The reason we have included both variant to this genomic coordinate, is that in the event that you actually identify this exact genomic coordinate in any sample, you do not simply report the synonymous variant that is class 4 for INH, but also report the missense p.Thr10Ala, which association with INH is unknown.

Best,

HillJamie commented 2 months ago

Thank you for the explanation, I think I understand what is happening now.