brentp / duphold

don't get DUP'ed or DEL'ed by your putative SVs.
MIT License
100 stars 9 forks source link

Logging information ended up in output file #11

Closed wdecoster closed 5 years ago

wdecoster commented 5 years ago

Hi Brent,

I ran the following command on a bunch of (single sample) vcf files and their bam file: ls *.vcf | parallel 'duphold --bam {.}.bam --fasta genome.fna.gz --vcf {} > {.}_dh.vcf'

I'm now post-processing these duphold-annotated vcfs for filtering using bcftools and tabix, and get a bunch of warnings and error which I'm looking into.

The first one is generated by bcftools sort:

[E::vcf_parse_format] Incorrect number of FORMAT fields at chr15:101764387

grepping for that line with -C 5 gives me:

chr15   101759833   23348   N   <DEL>   .   PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101760124;STD_quant_start=45;STD_quant_stop=78;Kurtosis_quant_start=2;Kurtosis_quant_stop=1;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-291;STRANDS=+-;RE=14;REF_strand=0,0;AF=1;GCF=0.592466    GT:DR:DV:DHFC:DHFFC:DHBFC   1/1:0:14:1.37143:0.979592:1.2973
chr15   101760689   23349   N   <DEL>   .   PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101760777;STD_quant_start=39;STD_quant_stop=36;Kurtosis_quant_start=-1;Kurtosis_quant_stop=-1;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-88;STRANDS=+-;RE=19;REF_strand=1,0;AF=0.95;GCF=0.550562    GT:DR:DV:DHFC:DHFFC:DHBFC   1/1:1:19:1.17143:0.836735:1.10811
chr15   101762149   23350   N   <DEL>   .   PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101762529;STD_quant_start=62;STD_quant_stop=173;Kurtosis_quant_start=0;Kurtosis_quant_stop=0;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-380;STRANDS=+-;RE=16;REF_strand=0,3;AF=0.842105;GCF=0.585302    GT:DR:DV:DHFC:DHFFC:DHBFC   1/1:3:16:1.14286:0.851064:1.08108
chr15   101763139   23351   N   <DEL>   .   PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101763229;STD_quant_start=19;STD_quant_stop=19;Kurtosis_quant_start=6;Kurtosis_quant_stop=5;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-90;STRANDS=+-;RE=12;REF_strand=1,1;AF=0.857143;GCF=0.571429  GT:DR:DV:DHFC:DHFFC:DHBFC   1/1:2:12:0.942857:0.717391:0.891892
chr15   101764281   23352   N   <DEL>   .   PASS    PRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101764346;STD_quant_start=7;STD_quant_stop=6;Kurtosis_quant_start=0;Kurtosis_quant_stop=0;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-65;STRANDS=+-;RE=26;REF_strand=2,0;AF=0.928571;GCF=0.606061  GT:DR:DV:DHFC:DHFFC:DHBFC   1/1:2:26:0.857143:0.769231:0.810811
chr15   101764387   23353   N   <DEL>   .   PASS    PRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101764484;STD_quant_start=3;STD_quant_stop=2;Kurtosis_quant_start=-1;Kurtosis_quant_stop=-1;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-97;STRANDS=+-;RE=22;REF_strand=0,0;AF=1;GCF=0.622449   GT:DR:DV:DHFC:DHFFC:DHBFC   1/1:0:22:0.828571:0.763158:0.783784I, [2018-11-25T16:58:37] -- duphold: no mates found. assuming single end reads
I, [2018-11-25T16:58:38] -- duphold: starting read of bam values for chrom: chr1
I, [2018-11-25T17:02:22] -- duphold: finished reading:chr1. starting gc-content, discordant sorting
I, [2018-11-25T17:02:29] -- duphold: finished gc-content, sorting
I, [2018-11-25T17:02:30] -- duphold: starting read of bam values for chrom: chr2
I, [2018-11-25T17:07:40] -- duphold: finished reading:chr2. starting gc-content, discordant sorting
<etc>

I suck at reading instructions, and it seems be that my input SV vcf file is not sorted. I'll repeat things there and see if the error reproduces, but I'm quite surprised to see these logging messages in the output file.

Cheers, Wouter

wdecoster commented 5 years ago

Quick update: sorting does not seem to prevent this printing of logs. I'll try again with explicitly specifying output with -o.

brentp commented 5 years ago

duphold always outputs those messages to stderr. your parallel command must somehow be redirecting stderr to stdout.

wdecoster commented 5 years ago

Hrm, possible, but that's something I haven't encountered before... Strange. In any case, when using -o everything is normal. Nice increase of precision (simulated pacbio data, called with sniffles) but a substantial reduction in recall (75% prior to filtering, 28% afterwards). I'll investigate further.