brentp / duphold

don't get DUP'ed or DEL'ed by your putative SVs.
MIT License
101 stars 9 forks source link

No error generated but duphold output missing last few lines. #25

Open jjfarrell opened 5 years ago

jjfarrell commented 5 years ago

I am running duphold on about 5000 crams. About 120 or so are finishing but the output vcf is not complete. It ends prematurely near the end of chromosome 22. There are only a few lines left to process of the imput vcf.

The last full vcf line output varies but all near the end. 1 50796757 24 50797069 24 50797489 7 50797585 7 50797605 4 50797606 11 50798225 14 50798764 25 50798783 20 50799200

here is an example of the end of one file...

chr22   50797069        rs201907533     GA      G       .       PASS    AVGCOV=7;MINCOV=7;ALTCOV=0;ZYG=hom;COVRATIO=1;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1579delA||||||INFO_REALIGN_3_PRIME;CAF=[0.9876,0.0124];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;RS=201907533;RSPOS=51235498;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000200;WGT=1;dbSNPBuildID=137;GCF=0.254902   GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.868421:1.17857:0.825:0
chr22   5079748

Any suggestions?

brentp commented 5 years ago

did those exit with 0? do you see: "[duphold] finished" in the stderr of those jobs? did you allocate enough time, memory for the job to finish on the nodes?

make sure you're using the latest duphold. I don't recall any changes relevant here, but it's always good to have the most recent.

jjfarrell commented 5 years ago

I am getting a [duphold] finished at the end of the run. I gave it plenty of time. The most recent version (version: 0.1.4) is being used.

Below is the job output. The tabix command runs after the duphold program finishes which gets the error seen.

This are the two command lines I am using:

duphold -f /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -v adsp5k.scalpel.genes.indel.norm.ann.vcf.gz -d -b $CRAM|bgzip -c > output/$SAMPLE.vcf.gz
tabix -p vcf output/$SAMPLE.vcf.gz

The output is:

more duphold.sh.o8967855
[duphold] finished
[E::hts_idx_push] Unsorted positions on sequence #22: 50797069 followed by 5079748
tbx_index_build failed: output/A-WCAP-WC000752-BL-COL-51868BL1.vcf.gz

duphold works fine for the 4600 other crams.

brentp commented 5 years ago

is the error intermittent? or does it occur every time for these same cram files?

jjfarrell commented 5 years ago

I had rerun them and had the same issue. I will try increasing the memory available on the nodes and see it that helps.

brentp commented 5 years ago

I doubt it's the memory, but I can't think of anything else. can you share the vcf+bam for one of the failing samples?

brentp commented 5 years ago

and you are running all samples with -v adsp5k.scalpel.genes.indel.norm.ann.vcf.gz?

jjfarrell commented 5 years ago

Yes, the same script is being used with the same vcf containing all the variants in the 5k samples.

I can not share these files but I will see if I can create a smaller test cram and vcf that recreates the issue.

I reran one on a node with plenty of memory writing it directly to a vcf. Still an error but not exactly at the same spot but close. So as you suggested not a memory issue.

tail output/test,vcf|awk '{print NF "\t" $0}'

10      chr22   50794791        .       CT      C       .       LowAltCnt       AVGCOV=3;MINCOV=3;ALTCOV=21;ZYG=het;COVRATIO=0.12;CHI2=13.5;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-3863delT||||||;GCF=0.431373   GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.03922:1.2619:1.01923:3
10      chr22   50794884        .       AC      A       .       PASS    AVGCOV=10;MINCOV=10;ALTCOV=22;ZYG=het;COVRATIO=0.31;CHI2=4.5;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-3770delC||||||;GCF=0.558824  GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.09804:1.30233:1.16667:3
10      chr22   50795110        .       AT      A       .       LowAltCnt;HighChi2score AVGCOV=3;MINCOV=3;ALTCOV=32;ZYG=het;COVRATIO=0.09;CHI2=24.03;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-3544delT||||||;GCF=0.45098   GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.745098:0.863636:0.730769:3
10      chr22   50795909        .       AT      A       .       PASS    AVGCOV=17;MINCOV=17;ALTCOV=21;ZYG=het;COVRATIO=0.45;CHI2=0.42;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2741delT||||||INFO_REALIGN_3_PRIME;GCF=0.254902     GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.01961:1.33333:0.83871:0
10      chr22   50795925        .       AAAGT   A       .       PASS    AVGCOV=19;MINCOV=19;ALTCOV=27;ZYG=het;COVRATIO=0.41;CHI2=1.39;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2726_333-2723delTAAG||||||INFO_REALIGN_3_PRIME;GCF=0.247619 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.01961:1.33333:0.83871:0
10      chr22   50796118        .       A       ACATACTAT       .       PASS    AVGCOV=20;MINCOV=20;ALTCOV=22;ZYG=het;COVRATIO=0.48;CHI2=0.1;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=ACATACTAT|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2534_333-2527dupTACTATCA||||||INFO_REALIGN_3_PRIME;GCF=0.336634      GT:AD:DP:DHFC:DHFFC:DHBFC       0/0:.:.:0.843137:1.16216:0.741379
10      chr22   50796247        .       GA      G       .       PASS    AVGCOV=10;MINCOV=10;ALTCOV=13;ZYG=het;COVRATIO=0.43;CHI2=0.39;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2406delA||||||INFO_REALIGN_3_PRIME;GCF=0.470588     GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.01961:1.44444:1:0
10      chr22   50796312        .       GT      G       .       PASS    AVGCOV=3;MINCOV=3;ALTCOV=11;ZYG=het;COVRATIO=0.21;CHI2=4.57;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2342delT||||||;GCF=0.460784   GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.607843:0.837838:0.596154:0
10      chr22   50796312        .       GTC     G       .       PASS    AVGCOV=6;MINCOV=6;ALTCOV=16;ZYG=het;COVRATIO=0.27;CHI2=4.55;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2342_333-2341delTC||||||;GCF=0.456311 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.607843:0.837838:0.596154:0
8       chr22   50796312        .       GTCCC   G       .       LowAltCnt       AVGCOV=4;MINCOV=4;ALTCOV=11;ZYG=het;COVRATIO=0.27;CHI2=3.27;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00
brentp commented 5 years ago

instead of piping to bgzip, can you instead just use -o $sample.vcf.gz? that will be faster anyway and rule out piping issues.

jjfarrell commented 5 years ago

That fixed it on the test sample! Must be some issue with closing the pipe. Thanks for pin pointing the issue so quickly.

I check a few of the samples that were not getting an error with tabix. Not all were getting the last few variants. They probably just happened to end with an EOL. So I will re-run it on all the samples.

tabix output/ADNI_002_S_4229.vcf.gz chr22:50799200|tail|awk '{print NF "\t" $0}'

10      chr22   50799200        .       CAACTT  C       .       PASS    AVGCOV=9;MINCOV=9;ALTCOV=9;ZYG=het;COVRATIO=0.5;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.881_885delCTTAA||||||INFO_REALIGN_3_PRIME;GCF=0.216981     GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.196078:0.277778:0.16129:1
10      chr22   50799573        .       CAT     C       .       LowAltCnt       AVGCOV=3;MINCOV=3;ALTCOV=15;ZYG=het;COVRATIO=0.17;CHI2=8;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.1253_1254delTA||||||INFO_REALIGN_3_PRIME;GCF=0.291262    GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.627451:0.842105:0.52459:0
[farrell@scc-hadoop duphold]$  zcat  output/ADNI_002_S_4229.vcf.gz|tail|awk '{print NF "\t" $0}'
10      chr22   50797069        rs201907533     GA      G       .       PASS    AVGCOV=7;MINCOV=7;ALTCOV=0;ZYG=hom;COVRATIO=1;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1579delA||||||INFO_REALIGN_3_PRIME;CAF=[0.9876,0.0124];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;RS=201907533;RSPOS=51235498;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000200;WGT=1;dbSNPBuildID=137;GCF=0.254902      GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP   0/0:.:.:0.431373:0.814815:0.354839:0
10      chr22   50797489        .       CAG     C       .       PASS    AVGCOV=8;MINCOV=8;ALTCOV=16;ZYG=het;COVRATIO=0.33;CHI2=2.67;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1165_333-1164delAG||||||;GCF=0.378641 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.529412:1.17391:0.490909:2
10      chr22   50797585        rs200507571     A       AT      .       PASS    AVGCOV=12;MINCOV=12;ALTCOV=15;ZYG=het;COVRATIO=0.44;CHI2=0.33;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=AT|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1064dupT||||||INFO_REALIGN_3_PRIME;CAF=[0.8962,0.1038];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;NOC;RS=200507571;RSPOS=51236013;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000210;WGT=1;dbSNPBuildID=137;GCF=0.306931        GT:AD:DP:DHFC:DHFFC:DHBFC        0/0:.:.:0.627451:1.3913:0.52459
10      chr22   50797605        .       C       CATTT   .       LowAltCnt;HighChi2score AVGCOV=3;MINCOV=3;ALTCOV=83;ZYG=het;COVRATIO=0.03;CHI2=74.42;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=CATTT|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1032_333-1029dupTTTA||||||INFO_REALIGN_3_PRIME;GCF=0.366337      GT:AD:DP:DHFC:DHFFC:DHBFC       0/0:.:.:0.627451:1.3913:0.551724
10      chr22   50797606        .       ATTTAT  A       .       PASS    AVGCOV=16;MINCOV=16;ALTCOV=20;ZYG=het;COVRATIO=0.44;CHI2=0.44;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1046_333-1042delTATTT||||||INFO_REALIGN_3_PRIME;GCF=0.377358        GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.627451:1.3913:0.581818:2
10      chr22   50798225        .       ATACT   A       .       PASS    AVGCOV=8;MINCOV=8;ALTCOV=8;ZYG=het;COVRATIO=0.5;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-427_333-424delCTTA||||||INFO_REALIGN_3_PRIME;GCF=0.2       GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.176471:0.36:0.147541:0
10      chr22   50798764        .       AAAG    A       .       PASS    AVGCOV=5;MINCOV=5;ALTCOV=22;ZYG=het;COVRATIO=0.19;CHI2=10.7;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.447_449delAGA||||||INFO_REALIGN_3_PRIME;GCF=0.538462  GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.607843:1.34783:0.645833:0
10      chr22   50798783        .       CT      C       .       PASS    AVGCOV=17;MINCOV=17;ALTCOV=22;ZYG=het;COVRATIO=0.44;CHI2=0.64;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.462delT||||||;GCF=0.568627  GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.607843:1.34783:0.645833:1
10      chr22   50799200        .       CAACTT  C       .       PASS    AVGCOV=9;MINCOV=9;ALTCOV=9;ZYG=het;COVRATIO=0.5;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.881_885delCTTAA||||||INFO_REALIGN_3_PRIME;GCF=0.216981     GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.196078:0.277778:0.16129:1
10      chr22   50799573        .       CAT     C       .       LowAltCnt       AVGCOV=3;MINCOV=3;ALTCOV=15;ZYG=het;COVRATIO=0.17;CHI2=8;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;A
brentp commented 5 years ago

I still don't know why/how this happened, but, if you weren't already, always run any script with pipes with the set -o pipefail option in bash to make sure you catch the actual error.

jjfarrell commented 5 years ago

There is no error triggered with that option.

I took a look at some of the code. I think it can be traced to this code.

https://github.com/brentp/hts-nim/blob/master/src/hts/vcf.nim

proc close*(v:VCF) =
  if v.fname != "-" and v.fname != "/dev/stdin" and v.hts != nil:
    if hts_close(v.hts) != 0:
        when defined(debug):
            stderr.write_line "[hts-nim] error closing vcf"
    v.hts = nil

Probably just needs a flushFile(stdout) if v.fname='-'. .

brentp commented 5 years ago

yes. that's it. nicely spotted! I think it should just hts_close("/dev/stdin"). I'll get this in the next release but you have a work-around for now. thanks for finding the solution.

brentp commented 5 years ago

here's a binary with that fixed. duphold.gz

brentp commented 5 years ago

this is fixed in latest release. thanks for reporting and testing.

cbrueffer commented 4 years ago

I see a similar/same issue with release 0.2.1; changing from piping to bgzip to using -o sample.vcf.gz fixed it for me as well. Happy to help debug this.

brentp commented 4 years ago

I don't see a way to fix this as it is explicitly closing the file and flushing in case of stdout. I did try adding an additional flush if you want to try the attached binary.

duphold_dev.gz

if this doesn't work, i'll probably add a check to ensure that stdout is not used.

cbrueffer commented 4 years ago

Thanks for the reply; I'm seeing the same problem with the binary you posted. What happens is that in the pipe case, the last VCF record is cut off. Original and dev binary produce the same VCF in the pipe case and the -o case respectively.

Last record with -o:

chrUn_GL000218v1    54466   gridss314fb_2462h   G   ]chrUn_GL000218v1:54465]GTGCACACGTGTGTG 120.52  LOW_QUAL;NO_ASSEMBLY    AS=0;ASC=1X;ASQ=0;ASRP=0;ASSR=0;BA=0;BANRP=0;BANRPQ=0;BANSR=0;BANSRQ=0;BAQ=0;BASRP=0;BASSR=0;BENAMES=NDX550213:6:HHHCCBGXF:1:21305:11427:10651;BMQ=47;BMQN=47;BMQX=47;BPNAMES=NDX550213:6:HHHCCBGXF:1:11201:6519:18766,NDX550213:6:HHHCCBGXF:2:12112:4327:15674,NDX550213:6:HHHCCBGXF:3:12407:2414:6309,NDX550213:6:HHHCCBGXF:4:23403:21325:6847;BQ=33.65;BSC=2;BSCQ=33.65;BUM=0;BUMQ=0;BVF=1;CAS=0;CASQ=0;CQ=120.52;EVENT=gridss314fb_2462;IC=4;IHOMPOS=0,0;IQ=120.52;MATEID=gridss314fb_2462o;MQ=32.25;MQN=24;MQX=40;RAS=0;RASQ=0;REF=19;REFPAIR=2;RP=0;RPQ=0;SB=0.666667;SC=1X111M;SR=0;SRQ=0;SVTYPE=BND;VF=4;GCF=0.39604    GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF:DHFC:DHFFC:DHBFC  .:0:0:0:0:0:0:0:0:0:0:33.65:2:33.65:0:0:1:0:4:120.52:120.52:0:19:2:0:0:0:0:4:1.11765:1.58333:1

Piped to bgzip:

chrUn_GL000218v1    54466   gridss314fb_2462h   G   ]chrUn_GL000218v1:54465]GTGCACACGTGTGTG 120.52  LOW_QUAL;NO_ASSEMBLY    AS=0;ASC=1X;ASQ=0;ASRP=0;ASSR=0;BA=0;BANRP=0;BANRPQ=0;BANSR=0;BANSRQ=0;BAQ=0;BASRP=0;BASSR=0;BENAMES=NDX550213:6:HHHCCBGXF:1:21305:11427:10651;BMQ=47;BMQN=47;BMQX=47;BPNAMES=NDX550213:6:HHHCCBGXF:1:11201:6519:18766,NDX550213:6:HHHCCBGXF:2:12112:4327:15674,NDX550213:6:HHHCCBGXF:3:12407:2414:6309,NDX550213:6:HHHCCBGXF:4:23403:21325:6847;BQ=33.65;BSC=2;BSCQ=33.65;BUM=0;BUMQ=0;BVF=1;CAS=0;CASQ=0;CQ=120.52;EVENT=gridss314fb_2462;IC=4;IHOMPOS=0,0;IQ=120.52;MATEID=gridss314fb_2462o;MQ=32.25;MQN=24;MQX=40;RAS=0;RASQ=0;REF=19;REFPAIR=2;RP=0;RPQ=0;SB=0.666667;SC=1X111M;SR=0;SRQ=0;SVTYPE=BND;VF=4;GCF=0.39604    GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RA

Piping into bgzip in the first place was a documentation issue for me; I didn't realize duphold could compress files directly before finding this PR, so perhaps this is what could be improved instead.

brentp commented 4 years ago

I am happy to accept a PR to improve docs.