SciLifeLab / TIDDIT

TIDDIT - structural variant calling
Other
62 stars 13 forks source link

Larger CNVs being called #117

Closed pauline-ng closed 2 weeks ago

pauline-ng commented 3 weeks ago

Hello,

I have been using TIDDIT 2.12.2 and decided to update to the latest version. I noticed that much larger CNVs are being called. Using Genome in a Bottle HG002 as an example, I see 2.5x more deletions that are > 1 Mb, and these aren't believable. The older version 2.12.2 has higher recall on the HG002 CNVs using reciprocal overlap.

Any recommendations on how I can run the the latest version so its results more closely reflect the older version? I want to move off the 2.12.2 version because it fails on abnormal GRCh38 contig names.

J35P312 commented 3 weeks ago

Hello! Though luck! Can you send a few examples of such call? I wonder if they are due to differences in coverage normalization. Older tiddit excluded the reference regions consisting of NNNNNNNNNN from the coverage calculation. Newer tiddit do not, hence there may be deletion calls representing misalignments across centromeres for instance. In older tiddit, those would probably have been reported as BND instead. These calls should be removed if you add a frequency database filter, but I may need to look in to this and do changes in TIDDIT. If you send a few calls I can see what can be done.

Best regards Jesper

pauline-ng commented 3 weeks ago

Hi Jesper,

Thank you for looking into this. Here are the CNVs

I'm using the structural variant HG002 GiaB benchmark (build 37)

Benchmark Coords Benchmark Length TIDDITT 2.12.2 Coords TIDDIT CNV Length
chr1:112,691,790 12,193 bp DEL 1:112691794 12190 bp
chr1:152555543 32,196 bp DEL 152555543 32,199 bp
In contrast: Benchmark Coords Benchmark Length TIDDITT 3.7.0 Coords TIDDIT CNV Length
chr1:112,691,790 12,193 bp DEL Not detected
chr1:152555543 32,196 bp DEL chr1:152555543 32,198 bp
chr1:85980241 58,869,168 bp (false positive DEL)
chr1:109598059 71,449,362 (false positive DEL)
chr1:117571897 74,953,897 (false positive DEL)

and the larger DELs (> 50 Mb) have QUAL PASS.

I can't share my bam file because it's generated by the company, but just running the 2 different TIDDIT versions on the same bam file, this command shows that there are a lot more DELs in the recent version compared to 2.12.2 bcftools view -i 'INFO/SVTYPE="DEL" && INFO/SVLEN > 1000000 && (%FILTER="PASS" | %FILTER=".")' tiddit.cnv.vcf.gz | grep -v ^# | wc -l I appreciate your help.

J35P312 commented 3 weeks ago

Hello! And thanks for the details, this is very useful! I have started to download HG002, then I will have a look at these variants in IGV, and run TIDDIT on it.

My guess is that the additional deletion calls are due to the classification of variants crossing the centromeres and other regions lacking coverage; but I will check these calls manually to make sure.

I will also check why the deletion on chr1:112,691,790 is not detected, it could be missed due to many reasons.

I will keep you updated! Hopefully I will be back with updates to you and to tiddit by next week :P. Best regards Jesper

J35P312 commented 2 weeks ago

Hello! I have now made a new release of TIDDIT 3.8.0; it produces fewer large false positive deletions by excluding reference regions consisting of NNNNNN.... from the coverage calculations:

version 3.7.0:

[jesperei@sens2023005-bianca jesper]$ bcftools view -i "SVTYPE=='DEL' && SVLEN > 1000000" tiddit_3_7_0_HG002.vcf | grep -c PASS 97

version 3.8.0:

[jesperei@sens2023005-bianca jesper]$ bcftools view -i "SVTYPE=='DEL' && SVLEN > 1000000" tiddit_3_8_0_HG002.vcf | grep -c PASS 5

Some of those 97 variants will be filtered, but some may be classified as BND or INV. I inspected a few of these "deletions": a few of them looks nice and probably represent some sort of event (like duplications or transpositions), others appear to be reference errors or issues due to homology.

Screenshot 2024-07-18 at 13 26 20

Good luck with your analyses!

J35P312 commented 2 weeks ago

By the way! I checked for the deletion at

chr1:112,691,790 12,193 bp DEL 1:112691794 12190 bp

It seems to be found in 3.7.0 and 3.8.0, do you see it if you run similar grep commands on your vcf?

[jesperei@sens2023005-bianca jesper]$ grep DEL tiddit_3_7_0_HG002.vcf | grep -P "1\t1126" | grep PASS 1 112691793 SV_168_1 N 50 PASS SVTYPE=DEL;SVLEN=12911;END=112704704;REGIONA=112691373,112691793;REGIONB=112704704,112705027;LFA=7,14;LFB=11,14;LTE=4,11;CTG=. GT:CN:COV:DV:RV:LQ:RR:DR 0/1:1:42.342042755344416,18.609153851365242,36.96296296296296:4:11:0.0,0.0:21,17:16,13

[jesperei@sens2023005-bianca jesper]$ grep DEL tiddit_3_8_0_HG002.vcf | grep -P "1\t1126" | grep PASS 1 112691793 SV_168_1 N 50 PASS SVTYPE=DEL;SVLEN=12911;END=112704704;REGIONA=112691373,112691793;REGIONB=112704704,112705027;LFA=7,14;LFB=11,14;LTE=4,11;CTG=. GT:CN:COV:DV:RV:LQ:RR:DR 0/1:1:42,18.609153851365242,36:4:11:0.0,0.0:21,17:16,13