Closed pauline-ng closed 2 weeks ago
Hello! Though luck! Can you send a few examples of such call? I wonder if they are due to differences in coverage normalization. Older tiddit excluded the reference regions consisting of NNNNNNNNNN from the coverage calculation. Newer tiddit do not, hence there may be deletion calls representing misalignments across centromeres for instance. In older tiddit, those would probably have been reported as BND instead. These calls should be removed if you add a frequency database filter, but I may need to look in to this and do changes in TIDDIT. If you send a few calls I can see what can be done.
Best regards Jesper
Hi Jesper,
Thank you for looking into this. Here are the CNVs
I'm using the structural variant HG002 GiaB benchmark (build 37)
Benchmark Coords | Benchmark Length | TIDDITT 2.12.2 Coords | TIDDIT CNV Length |
---|---|---|---|
chr1:112,691,790 | 12,193 bp DEL | 1:112691794 | 12190 bp |
chr1:152555543 | 32,196 bp DEL | 152555543 | 32,199 bp |
In contrast: | Benchmark Coords | Benchmark Length | TIDDITT 3.7.0 Coords | TIDDIT CNV Length |
---|---|---|---|---|
chr1:112,691,790 | 12,193 bp DEL | Not detected | ||
chr1:152555543 | 32,196 bp DEL | chr1:152555543 | 32,198 bp | |
chr1:85980241 | 58,869,168 bp (false positive DEL) | |||
chr1:109598059 | 71,449,362 (false positive DEL) | |||
chr1:117571897 | 74,953,897 (false positive DEL) |
and the larger DELs (> 50 Mb) have QUAL PASS.
I can't share my bam file because it's generated by the company, but just running the 2 different TIDDIT versions on the same bam file, this command shows that there are a lot more DELs in the recent version compared to 2.12.2
bcftools view -i 'INFO/SVTYPE="DEL" && INFO/SVLEN > 1000000 && (%FILTER="PASS" | %FILTER=".")' tiddit.cnv.vcf.gz | grep -v ^# | wc -l
I appreciate your help.
Hello! And thanks for the details, this is very useful! I have started to download HG002, then I will have a look at these variants in IGV, and run TIDDIT on it.
My guess is that the additional deletion calls are due to the classification of variants crossing the centromeres and other regions lacking coverage; but I will check these calls manually to make sure.
I will also check why the deletion on chr1:112,691,790 is not detected, it could be missed due to many reasons.
I will keep you updated! Hopefully I will be back with updates to you and to tiddit by next week :P. Best regards Jesper
Hello! I have now made a new release of TIDDIT 3.8.0; it produces fewer large false positive deletions by excluding reference regions consisting of NNNNNN.... from the coverage calculations:
version 3.7.0:
[jesperei@sens2023005-bianca jesper]$ bcftools view -i "SVTYPE=='DEL' && SVLEN > 1000000" tiddit_3_7_0_HG002.vcf | grep -c PASS 97
version 3.8.0:
[jesperei@sens2023005-bianca jesper]$ bcftools view -i "SVTYPE=='DEL' && SVLEN > 1000000" tiddit_3_8_0_HG002.vcf | grep -c PASS 5
Some of those 97 variants will be filtered, but some may be classified as BND or INV. I inspected a few of these "deletions": a few of them looks nice and probably represent some sort of event (like duplications or transpositions), others appear to be reference errors or issues due to homology.
Good luck with your analyses!
By the way! I checked for the deletion at
chr1:112,691,790 12,193 bp DEL 1:112691794 12190 bp
It seems to be found in 3.7.0 and 3.8.0, do you see it if you run similar grep commands on your vcf?
[jesperei@sens2023005-bianca jesper]$ grep DEL tiddit_3_7_0_HG002.vcf | grep -P "1\t1126" | grep PASS
1 112691793 SV_168_1 N 50 PASS SVTYPE=DEL;SVLEN=12911;END=112704704;REGIONA=112691373,112691793;REGIONB=112704704,112705027;LFA=7,14;LFB=11,14;LTE=4,11;CTG=. GT:CN:COV:DV:RV:LQ:RR:DR 0/1:1:42.342042755344416,18.609153851365242,36.96296296296296:4:11:0.0,0.0:21,17:16,13
[jesperei@sens2023005-bianca jesper]$ grep DEL tiddit_3_8_0_HG002.vcf | grep -P "1\t1126" | grep PASS
1 112691793 SV_168_1 N 50 PASS SVTYPE=DEL;SVLEN=12911;END=112704704;REGIONA=112691373,112691793;REGIONB=112704704,112705027;LFA=7,14;LFB=11,14;LTE=4,11;CTG=. GT:CN:COV:DV:RV:LQ:RR:DR 0/1:1:42,18.609153851365242,36:4:11:0.0,0.0:21,17:16,13
Hello,
I have been using TIDDIT 2.12.2 and decided to update to the latest version. I noticed that much larger CNVs are being called. Using Genome in a Bottle HG002 as an example, I see 2.5x more deletions that are > 1 Mb, and these aren't believable. The older version 2.12.2 has higher recall on the HG002 CNVs using reciprocal overlap.
Any recommendations on how I can run the the latest version so its results more closely reflect the older version? I want to move off the 2.12.2 version because it fails on abnormal GRCh38 contig names.