GTB-tbsequencing / mutation-catalogue-2023

MIT License
12 stars 1 forks source link

Not all deletions are left-aligned #8

Open HillJamie opened 2 months ago

HillJamie commented 2 months ago

Thank you for an excellent and well-documented resource.

According to the catalogue

"Before insertion into GenPhenSQL, all variant coordinates were normalized with bcftools norm; variants that spanned several nucleotides were not decomposed into single variants in order to preserve the correctness of the annotation. Large-scale deletions were inserted next to the other genotype calls according to the coordinates of the deletion, as determined by delly."

I interpret this as meaning that large-scale deletions are not normalized, but are reported exactly where Delly reported them. I cannot find details on which version of Delly was used to construct the catalogue, but the Delly-users group states that alignment has changed from right-aligned to left-aligned about 7 years ago https://groups.google.com/g/delly-users/c/FHp4BY73A8Y/m/lK7-y7qJBwAJ

However, the following variant appears to be right-aligned (the deletion can be moved 3nt to the left)

NC_000962.3 4327350 . CCATGGATTCCCGCTTTTCCAGGATGGCGTAGCTCTTGGTCGGGCAACGGTCCTGCAGGTGCCAGGCCGCGCTGACACCGGAGATTCCAGCGCCCACGATGACAACGTCGAGGTGCTCGGTCATGGATCCACGCTATCAACGTAATGTCGAGGCCGTCAACGAGATGTCGACACTATCGACACGTAGTAAGCTGCCAGGGTGACCACCTCCGCGGCCAGTCAGGCTTCGCTGCCTAGGGGCCGGCGCACCGCGCGGCCGTCCGGCGACGATCGTGAACTGGCGATCCTCGCCACCGCCGAGAACCTTCTCGAGGACCGTCCGCTGGCCGATATCTCGGTCGACGATCTGGCCAAGGGCGCCGGTATCTCGAGGCCGACGTTCTACTTCTATTTCCCATCCAAGGAAGCGGTGCTGCTGACCCTGCTGGACCGGGTGGTCAATCAAGCCGACATGGCCCTACAGACCCTTGCCGAGAATCCCGCCGACACCGACCGCGAGAACATGTGGCGCACCGGGATCAACGTGTTCTTCGAGACATTCGGGTCGCACAAGGCGGTAACCCGAGCCGGTCAGGCCGCCAGGGCAACCAGTGTCGAAGTCGCCGAACTGTGGTCGACGTTTATGCAGAAGTGGATCGCCTACACGGCCGCCGTGATCGACGCCGAACGCGACCGAGGCGCGGCGCCGCGCACCCTGCCGGCCCATGAACTGGCCACAGCGCTCAACCTGATGAACGAGCGGACGCTGTTCGCGTCATTCGCCGGCGAACAGCCCTCGGTGCCGGAAGCCCGCGTGCTGGATACGCTGGTGCACATCTGGGTGACCAGCATTTACGGCGAGAACCGCTAAGCCGCACTCGGTCGGGGGTGCTCGGTCGATGCTCAGTGCCAAAGCGGCATGCAGATCTCACGGAGGTCCGGTGGACGATCTGGCAGCC C 100 PASS graded_variant=ethA_p.Met1?

Would it be possible to normalize the deletions in the catalogue, preferably by left-alignment?

Best wishes, Jamie

sachalau commented 2 months ago

Dear Jamie,

Until your report, I had assumed that no variants generated from delly had been included in the genomic_coordinates file and clearly that was wrong, so thank you for it.

You are correct that we did not in fact normalize delly variants. However, we did use delly 0.8.3 version, which is more recent than the version discussed in your link.

However, the issue is that the exact genomic coordinate for that variant was not directly reported by delly. Here's how delly reported that variant initially :

NC_000962.3     4327350 DEL00001557     C       <DEL>   180     PASS    IMPRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.8.3;END=4328288;PE=3;MAPQ=60;CT=3to5;CIPOS=-351,351;CIEND=-351,351                                                                                                                                       GT:GL:GQ:FT:RCL:RC:RCR:CN:DR:DV:RR:RV    0/1:-2.88675,0,-590.887:29:PASS:289:572:207:2:104:6:0:0

As you see, it reports the alternative allele as <DEL>. We have internal logic that takes as input this entry, and using the value provided in the INFO/END tag, re computes the exact genomic coordinates.

The overall issue is then that our logic does not left-aligns after creating the variant. This is something we should be fixing indeed.

However, for the current version of the output, I'm not sure I'll take any action. In essence, I would prefer that no variants from delly were included in the genomic_coordinates, so my actual preferred fix would be to remove this one (an all others).

In case you already have that information at hand, could you please let me know how many variants in the genomic_coordinates are not correctly left-aligned, so that I can assess the extent of the issue?

Thank you for your cooperation

HillJamie commented 2 months ago

Thank you for a fast and thorough response.

In case you already have that information at hand, could you please let me know how many variants in the genomic_coordinates are not correctly left-aligned, so that I can assess the extent of the issue?

I'm afraid I don't have that information. I simulated reads from 4 deletions with lengths >700 nt and 2 of them were not left-aligned. I additionally simulated 5 deletions with lengths >15nt and <30nt but all of these were left-aligned, possibly because it was not possible to shift their coordinates.

I would prefer that no variants from delly were included in the genomic_coordinates

Is this because of concerns about Delly / internal logic to handle Delly? I'm curious, because it would seem that these variants can reliably be called.

sachalau commented 2 months ago

Smaller deletions (typically those shorter than 30nt) most likely have been generated by our other genotypers, so these one underwent correct normalization, which includes left alignments. Thank you for the estimates of the longer deletions, indeed on second thought, most of them will be incorrect, as none underwent left alignment.

Is this because of concerns about Delly / internal logic to handle Delly? I'm curious, because it would seem that these variants can reliably be called.

The reason is not because I have concerns about this variant or delly itself, but it's mostly because I think the added value of having in genomic_coordinate file is very close to zero for the end users. The intent for the genomic_coordinate entries were that each entries should be matched against variant found in a particular sample when all 4 coordinates are identical (reference sequence id, position, reference allele, alternative allele). Thus, for deletion which length will always be above a couple of hundred nucleotide, the likeliness of a user finding the exact same deletion (ie same boundaries) were in my opinion vanishingly small. My advice for people wanting to detect correctly the effect on drug resistance of large scale deletion would not be to match them directly via the genome coordinates but rather through a different logic (ie using a third party tool to determine which genes are included in the deletion and then include the logic based on this gene list).

Of course if the genomic coordinates were actually left normalized these changes will be slightly higher. This is something we will be looking at correcting for the next version but we won't be updating the files to correct that.