fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
559 stars 93 forks source link

Known SV reported in single-sample SV calling, but not in multisample SV calling (de_novo SV) #383

Open prasundutta87 opened 1 year ago

prasundutta87 commented 1 year ago

Hi,

My primary aim is to check for de novo SVs in a trio. A de novo SV was detected in a known gene from microarray. Multisample SV calling did not result into it. However, when the proband was run with --long-del-length 1000000 in a single sample mode, the SV was found. It was not found in the default single sample mode (becasue of the default --long-del-length value which is capped at 50k and this deletion was >50kb in length). A visualization through samplot also reveals a heterozygous deletion, although the genotype of the SV was homozygous.

The main issue is that in the multisample SV calling mode, that SV was not detected at all. I even tried removing any kind of default QC using --qc-output-all, I found out that this option does not work because same number of SVs are detected with out without this option in multisample mode. It actually works in single sample mode. So, there was no option for me to check why the deletion was getting filtered out/not detected in the multisample mode. These are the tests I did in single sample and multisample modes:

Single sample tests: sniffles --input <BAM> --vcf <OUTPUT.vcf.gz> --long-del-length 1000000 --qc-output-all --symbolic --mapq 5 --minsupport 1 Result: DEL Found

sniffles --input <BAM> --vcf <OUTPUT.vcf.gz> --symbolic -t 16 Result: DEL Not Found

sniffles <BAM> --vcf <OUTPUT.vcf.gz> --symbolic --long-del-length 1000000 -t 16 Result: DEL Found

sniffles --input <BAM> --tandem-repeats human_GRCh38_no_alt_analysis_set.trf.bed --vcf <OUTPUT.vcf.gz> --symbolic --long-del-length 1000000 -t 16 Result: DEL Found

Multisample tests: sniffles --input <Trio SNF files> --tandem-repeats human_GRCh38_no_alt_analysis_set.trf.bed --reference GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna --vcf <OUTPUT.vcf.gz> --symbolic --long-del-length 1000000 -t 16 Result: DEL Not found

sniffles --input <Trio SNF files> --tandem-repeats human_GRCh38_no_alt_analysis_set.trf.bed --reference GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna --vcf <OUTPUT.vcf.gz> --long-del-length 1000000 --qc-output-all --symbolic --mapq 5 --minsupport 1 -t 16 Result: DEL Not Found

sniffles --input <Trio SNF files> --tandem-repeats human_GRCh38_no_alt_analysis_set.trf.bed --reference GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna --vcf <OUTPUT.vcf.gz> --symbolic --long-del-length 1000000 --qc-output-all -t 16 Result: DEL Not found

So, there are three issues:

1) Known deletion only found in single sample mode 2) Wrong genotype assigned, although, I can understand why (GT:GQ:DR:DV 1/1:12:1:8)..cutesv genotyped it correctly with 4 reference supporting reads and assigned it a heterozygous genotype. 3) --qc-output-all does not work in multisample mode

Any inputs/suggestions on this will be greatly helpful. Some good SVs could be missed because of this.

Regards, Prasun

PS: Sniffles version 2.0.6

prasundutta87 commented 1 year ago

Just wondering if the --long-del-length 1000000 needs to be done in the SNF generation step. Will try this out.

prasundutta87 commented 1 year ago

HI, I tried it out and now I get the result in the multisample method as well. It makes sense though that the SV signatures do get QC'ed while getting written to SNF.

Just a small additional query..is it still advisable to give the tandem repeat annotations while SNF generation, or its okay to add it when the actual multisample VCF generation takes place? the supplementary figure of the snf generation workflow in the paper adds the tandem duplication in the snf generation step. Also, in order to incorporate other putative insertions and duplications, is it advisable to change --long-ins-length and --long-dup-length to 1000000 too? Changing defaults too much may give some spurious results as well.