brentp / duphold

don't get DUP'ed or DEL'ed by your putative SVs.
MIT License
101 stars 9 forks source link

Duphold with Nanopore data? #45

Open AnneBoshove opened 2 years ago

AnneBoshove commented 2 years ago

Hi,

I used duphold on some VCFs containing SVs called with Sniffles where the alignment was done using minimap2. The data used is all nanopore longread data. I was wondering if duphold is supposed to work properly on long read data? As quite a bit of the results seemed off/incorrect with very high numbers. Here's a sample of some of the results (parsed to obtain the info relevant for me):

SVtype | #CHROM | POS | End | SVLength | Precision | RE | Genotype | Consequence | GeneSymbol | EnsemblID | DHFC | DHFFC | DHBFC

DEL | Chr3 | 21648500 | 21946887 | -298387 | IMPRECISE | 10 | 01/Jan | coding_sequence_variant&5_prime_UTR_variant&intron_variant&feature_truncation | ARHGAP17 | ENSSSCG00000031488 | 181.481 | 101.031 | 178.182 DEL | Chr5 | 90902052 | 90907562 | -5510 | PRECISE | 39 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | CDK17 | ENSSSCG00000028182 | 0.963636 | 0.386861 | 0.946429 DEL | Chr2 | 122165466 | 122173123 | -7657 | PRECISE | 31 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | COMMD10 | ENSSSCG00000014223 | 161.818 | 143.548 | 161.818 INS | Chr6 | 13229037 | 13229037 | 46 | PRECISE | 21 | 0/1 | coding_sequence_variant&feature_elongation | GLG1 | ENSSSCG00000030420 | 0.814815 | 1 | 0.814815 DEL | Chr10 | 66054310 | 66056761 | -2451 | PRECISE | 62 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | ITIH5 | ENSSSCG00000011129 | 0.909091 | 0.438596 | 0.925926 DEL | Chr2 | 28721000 | 28772243 | -51243 | PRECISE | 69 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | KIAA1549L | ENSSSCG00000013311 | 143.636 | 0.153696 | 143.636 INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 130 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815 INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 143 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815 DEL | ChrX | 90375498 | 90376075 | -577 | PRECISE | 232 | 01/Jan | coding_sequence_variant&intron_variant&feature_truncation | MID2 | ENSSSCG00000012564 | 117.857 | 0.103774 | 117.857 INS | Chr13 | 135062890 | 135063265 | 1039 | IMPRECISE | 20 | 0/1 | coding_sequence_variant&intron_variant&feature_elongation | MUC13 | ENSSSCG00000011862 | 0.75 | 0.976744 | 0.792453

Thanks in advance,

Kind regards, Anne

brentp commented 2 years ago

Hi, yes, duphold should work very well on long read data. a lot of those do seem extreme. How deep is the coverage for this sample? I would look at a few of them in IGV. If this is the worst of thousands of variants, it doesn't seem too surprising and is probably because of weird regions (I see MUC13 in there, for example). If this is a random subset of variants or representative of the results, then something else must be going on.

AnneBoshove commented 2 years ago

Thanks for your reply @brentp

The coverage should be around 120X as 3 flowcells were used in the sequencing. Unfortunately this is just a random sample so not the worst, i've seen several values around 300/400, the highest i've seen was around 900. Values around 100 are quite common in my output. Looking at the results in a genome browser is a bit difficult for me as my device isn't the greatest and struggles with loading/displaying such large files, but I will try.

Could it be that something is off about the format of my VCF that I used as input for duphold? They were ran through sniffles and then VEP, i'll give a few sample lines:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LWlongreadsMD.sorted.bam
Chr1    267634  0       N       AGTGTCTCATGTCCGGGTCCGTATCCGTGTCCCATGTCCATGTCCCGTGTCCGTGTCCACGTTGTCTCCATGTCCCGTGTCCGTGTCTCATGTCGGCGTCCGTGTCTCATGTCCTGCTCGTGTCCGTGTCTCAGGTCTGGGTCCTGTGTCCGGGTCCGTGTCCCGTGTCCCGTGTCCGTGTCCTGTCCACAGTGTCCGTGTCTCATGTCCGGGTCCCGTGTCCGTGTCCCATGTCCATGTCCAG    .       PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=267781;STD_quant_start=12.555591;STD_quant_stop=74.402381;Kurtosis_quant_start=-1.235287;Kurtosis_quant_stop=-1.844707;SVTYPE=INS;SUPTYPE=AL;SVLEN=246;STRANDS=+-;STRANDS2=19,10,19,10;RE=29;REF_strand=36,21;Strandbias_pval=1;AF=0.337209;CSQ=insertion|downstream_gene_variant|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000070783|lncRNA|||||||||||1725|1||||LW.gtf.gz|,insertion|intron_variant&non_coding_transcript_variant&feature_elongation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000072648|lncRNA||1/1||||||||||1||||LW.gtf.gz|,insertion|intron_variant&non_coding_transcript_variant&feature_elongation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000091536|lncRNA||1/1||||||||||1||||LW.gtf.gz| GT:DR:DV        0/1:57:29
Chr1    267767  1       TGGGTCCTGTGTCCGTGTCCCATGTCCATGTCCCGTGTCCGTGTCTCAGGT     N       .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=267818;STD_quant_start=3.683942;STD_quant_stop=3.693624;Kurtosis_quant_start=0.301027;Kurtosis_quant_stop=0.380196;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-51;STRANDS=+-;STRANDS2=17,11,17,11;RE=28;REF_strand=36,21;Strandbias_pval=1;AF=0.329412;CSQ=deletion|downstream_gene_variant|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000070783|lncRNA|||||||||||1858|1||||LW.gtf.gz|,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000072648|lncRNA||1/1||||||||||1||||LW.gtf.gz|,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000091536|lncRNA||1/1||||||||||1||||LW.gtf.gz|  GT:DR:DV        0/1:57:28
Chr1    294446  2       GTGTGCGTGTGTTGTGTGTGGGGTGTGTGTGTGTGTGGGGTGTGTGTGTGGTGTGTGTGTTGTGTGTTTGTGGTGTGTGTGGTGTGTATGTG    N       .       PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=294538;STD_quant_start=17.472056;STD_quant_stop=21.153121;Kurtosis_quant_start=-0.231708;Kurtosis_quant_stop=0.206432;SVTYPE=DEL;SUPTYPE=AL,NR;SVLEN=-92;STRANDS=+-;STRANDS2=17,5,17,5;RE=22;REF_strand=33,19;Strandbias_pval=0.288655;AF=0.297297;CSQ=deletion|intergenic_variant|MODIFIER||||||||||||||||||||||      GT:DR:DV        0/0:52:22
brentp commented 2 years ago

Hi, perhaps you can plot a few of these with samplot? If you show the result here, I might be able to tell more what's going on.

AnneBoshove commented 2 years ago

I managed to look at the data in JBrowse and looked at a few of the SVs with high numbers;

afbeelding

afbeelding

This is the following DUP:

DUP | Chr6 | 165030128 | 165048325 | 18197 | IMPRECISE | 149 | 0/1 | transcript_amplification | ENSSSCG00000003889
DHFC: 962.963 | DHFFC: 597.701 | DHBFC: 981.132

And here's some deletions

DEL Chr13 73763246 73770714 -7468 PRECISE 34 0/1 transcript_ablation ENSSSCG00000045067 DHFC: 310.714 DHFFC: 263.636 DHBFC: 316.364

afbeelding

DEL Chr2 122165466 122173123 -7657 PRECISE 31 0/1 coding_sequence_variant&intron_variant&feature_truncation COMMD10 ENSSSCG00000014223 161.818 143.548 161.818

afbeelding

Only thing that seems a bit strange to me it that those deletions have a higher coverage, I don't see what it could be about the duplication that gives it such high values though.

brentp commented 2 years ago

Yes, something is wrong. Would be nice to see these in IGV or samplot. If you can share one chromosome (bam,vcf) then I can have a look.

AnneBoshove commented 2 years ago

I was having trouble attaching the files here, so i've send them to you in the mail. Maybe good to know that this alignment is done by mapping long reads against an assembly that was made with the same long reads, to find heterozygous SVs. Thanks for all the help!

brentp commented 2 years ago

thanks, can you also send the reference fasta via that service?

AnneBoshove commented 2 years ago

Yes, you should have received it

brentp commented 2 years ago

got it. will go through this in next few days.