Open AnneBoshove opened 2 years ago
Hi, yes, duphold should work very well on long read data. a lot of those do seem extreme. How deep is the coverage for this sample? I would look at a few of them in IGV. If this is the worst of thousands of variants, it doesn't seem too surprising and is probably because of weird regions (I see MUC13 in there, for example). If this is a random subset of variants or representative of the results, then something else must be going on.
Thanks for your reply @brentp
The coverage should be around 120X as 3 flowcells were used in the sequencing. Unfortunately this is just a random sample so not the worst, i've seen several values around 300/400, the highest i've seen was around 900. Values around 100 are quite common in my output. Looking at the results in a genome browser is a bit difficult for me as my device isn't the greatest and struggles with loading/displaying such large files, but I will try.
Could it be that something is off about the format of my VCF that I used as input for duphold? They were ran through sniffles and then VEP, i'll give a few sample lines:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LWlongreadsMD.sorted.bam
Chr1 267634 0 N AGTGTCTCATGTCCGGGTCCGTATCCGTGTCCCATGTCCATGTCCCGTGTCCGTGTCCACGTTGTCTCCATGTCCCGTGTCCGTGTCTCATGTCGGCGTCCGTGTCTCATGTCCTGCTCGTGTCCGTGTCTCAGGTCTGGGTCCTGTGTCCGGGTCCGTGTCCCGTGTCCCGTGTCCGTGTCCTGTCCACAGTGTCCGTGTCTCATGTCCGGGTCCCGTGTCCGTGTCCCATGTCCATGTCCAG . PASS IMPRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=267781;STD_quant_start=12.555591;STD_quant_stop=74.402381;Kurtosis_quant_start=-1.235287;Kurtosis_quant_stop=-1.844707;SVTYPE=INS;SUPTYPE=AL;SVLEN=246;STRANDS=+-;STRANDS2=19,10,19,10;RE=29;REF_strand=36,21;Strandbias_pval=1;AF=0.337209;CSQ=insertion|downstream_gene_variant|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000070783|lncRNA|||||||||||1725|1||||LW.gtf.gz|,insertion|intron_variant&non_coding_transcript_variant&feature_elongation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000072648|lncRNA||1/1||||||||||1||||LW.gtf.gz|,insertion|intron_variant&non_coding_transcript_variant&feature_elongation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000091536|lncRNA||1/1||||||||||1||||LW.gtf.gz| GT:DR:DV 0/1:57:29
Chr1 267767 1 TGGGTCCTGTGTCCGTGTCCCATGTCCATGTCCCGTGTCCGTGTCTCAGGT N . PASS PRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=267818;STD_quant_start=3.683942;STD_quant_stop=3.693624;Kurtosis_quant_start=0.301027;Kurtosis_quant_stop=0.380196;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-51;STRANDS=+-;STRANDS2=17,11,17,11;RE=28;REF_strand=36,21;Strandbias_pval=1;AF=0.329412;CSQ=deletion|downstream_gene_variant|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000070783|lncRNA|||||||||||1858|1||||LW.gtf.gz|,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000072648|lncRNA||1/1||||||||||1||||LW.gtf.gz|,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000091536|lncRNA||1/1||||||||||1||||LW.gtf.gz| GT:DR:DV 0/1:57:28
Chr1 294446 2 GTGTGCGTGTGTTGTGTGTGGGGTGTGTGTGTGTGTGGGGTGTGTGTGTGGTGTGTGTGTTGTGTGTTTGTGGTGTGTGTGGTGTGTATGTG N . PASS IMPRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=294538;STD_quant_start=17.472056;STD_quant_stop=21.153121;Kurtosis_quant_start=-0.231708;Kurtosis_quant_stop=0.206432;SVTYPE=DEL;SUPTYPE=AL,NR;SVLEN=-92;STRANDS=+-;STRANDS2=17,5,17,5;RE=22;REF_strand=33,19;Strandbias_pval=0.288655;AF=0.297297;CSQ=deletion|intergenic_variant|MODIFIER|||||||||||||||||||||| GT:DR:DV 0/0:52:22
Hi, perhaps you can plot a few of these with samplot? If you show the result here, I might be able to tell more what's going on.
I managed to look at the data in JBrowse and looked at a few of the SVs with high numbers;
This is the following DUP:
DUP | Chr6 | 165030128 | 165048325 | 18197 | IMPRECISE | 149 | 0/1 | transcript_amplification | ENSSSCG00000003889
DHFC: 962.963 | DHFFC: 597.701 | DHBFC: 981.132
And here's some deletions
DEL Chr13 73763246 73770714 -7468 PRECISE 34 0/1 transcript_ablation ENSSSCG00000045067 DHFC: 310.714 DHFFC: 263.636 DHBFC: 316.364
DEL Chr2 122165466 122173123 -7657 PRECISE 31 0/1 coding_sequence_variant&intron_variant&feature_truncation COMMD10 ENSSSCG00000014223 161.818 143.548 161.818
Only thing that seems a bit strange to me it that those deletions have a higher coverage, I don't see what it could be about the duplication that gives it such high values though.
Yes, something is wrong. Would be nice to see these in IGV or samplot. If you can share one chromosome (bam,vcf) then I can have a look.
I was having trouble attaching the files here, so i've send them to you in the mail. Maybe good to know that this alignment is done by mapping long reads against an assembly that was made with the same long reads, to find heterozygous SVs. Thanks for all the help!
thanks, can you also send the reference fasta via that service?
Yes, you should have received it
got it. will go through this in next few days.
Hi,
I used duphold on some VCFs containing SVs called with Sniffles where the alignment was done using minimap2. The data used is all nanopore longread data. I was wondering if duphold is supposed to work properly on long read data? As quite a bit of the results seemed off/incorrect with very high numbers. Here's a sample of some of the results (parsed to obtain the info relevant for me):
SVtype | #CHROM | POS | End | SVLength | Precision | RE | Genotype | Consequence | GeneSymbol | EnsemblID | DHFC | DHFFC | DHBFC
DEL | Chr3 | 21648500 | 21946887 | -298387 | IMPRECISE | 10 | 01/Jan | coding_sequence_variant&5_prime_UTR_variant&intron_variant&feature_truncation | ARHGAP17 | ENSSSCG00000031488 | 181.481 | 101.031 | 178.182 DEL | Chr5 | 90902052 | 90907562 | -5510 | PRECISE | 39 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | CDK17 | ENSSSCG00000028182 | 0.963636 | 0.386861 | 0.946429 DEL | Chr2 | 122165466 | 122173123 | -7657 | PRECISE | 31 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | COMMD10 | ENSSSCG00000014223 | 161.818 | 143.548 | 161.818 INS | Chr6 | 13229037 | 13229037 | 46 | PRECISE | 21 | 0/1 | coding_sequence_variant&feature_elongation | GLG1 | ENSSSCG00000030420 | 0.814815 | 1 | 0.814815 DEL | Chr10 | 66054310 | 66056761 | -2451 | PRECISE | 62 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | ITIH5 | ENSSSCG00000011129 | 0.909091 | 0.438596 | 0.925926 DEL | Chr2 | 28721000 | 28772243 | -51243 | PRECISE | 69 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | KIAA1549L | ENSSSCG00000013311 | 143.636 | 0.153696 | 143.636 INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 130 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815 INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 143 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815 DEL | ChrX | 90375498 | 90376075 | -577 | PRECISE | 232 | 01/Jan | coding_sequence_variant&intron_variant&feature_truncation | MID2 | ENSSSCG00000012564 | 117.857 | 0.103774 | 117.857 INS | Chr13 | 135062890 | 135063265 | 1039 | IMPRECISE | 20 | 0/1 | coding_sequence_variant&intron_variant&feature_elongation | MUC13 | ENSSSCG00000011862 | 0.75 | 0.976744 | 0.792453
Thanks in advance,
Kind regards, Anne