Open jjfarrell opened 5 years ago
did those exit with 0? do you see: "[duphold] finished" in the stderr of those jobs? did you allocate enough time, memory for the job to finish on the nodes?
make sure you're using the latest duphold. I don't recall any changes relevant here, but it's always good to have the most recent.
I am getting a [duphold] finished at the end of the run. I gave it plenty of time. The most recent version (version: 0.1.4) is being used.
Below is the job output. The tabix command runs after the duphold program finishes which gets the error seen.
This are the two command lines I am using:
duphold -f /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -v adsp5k.scalpel.genes.indel.norm.ann.vcf.gz -d -b $CRAM|bgzip -c > output/$SAMPLE.vcf.gz
tabix -p vcf output/$SAMPLE.vcf.gz
The output is:
more duphold.sh.o8967855
[duphold] finished
[E::hts_idx_push] Unsorted positions on sequence #22: 50797069 followed by 5079748
tbx_index_build failed: output/A-WCAP-WC000752-BL-COL-51868BL1.vcf.gz
duphold works fine for the 4600 other crams.
is the error intermittent? or does it occur every time for these same cram files?
I had rerun them and had the same issue. I will try increasing the memory available on the nodes and see it that helps.
I doubt it's the memory, but I can't think of anything else. can you share the vcf+bam for one of the failing samples?
and you are running all samples with -v adsp5k.scalpel.genes.indel.norm.ann.vcf.gz
?
Yes, the same script is being used with the same vcf containing all the variants in the 5k samples.
I can not share these files but I will see if I can create a smaller test cram and vcf that recreates the issue.
I reran one on a node with plenty of memory writing it directly to a vcf. Still an error but not exactly at the same spot but close. So as you suggested not a memory issue.
tail output/test,vcf|awk '{print NF "\t" $0}'
10 chr22 50794791 . CT C . LowAltCnt AVGCOV=3;MINCOV=3;ALTCOV=21;ZYG=het;COVRATIO=0.12;CHI2=13.5;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-3863delT||||||;GCF=0.431373 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:1.03922:1.2619:1.01923:3
10 chr22 50794884 . AC A . PASS AVGCOV=10;MINCOV=10;ALTCOV=22;ZYG=het;COVRATIO=0.31;CHI2=4.5;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-3770delC||||||;GCF=0.558824 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:1.09804:1.30233:1.16667:3
10 chr22 50795110 . AT A . LowAltCnt;HighChi2score AVGCOV=3;MINCOV=3;ALTCOV=32;ZYG=het;COVRATIO=0.09;CHI2=24.03;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-3544delT||||||;GCF=0.45098 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.745098:0.863636:0.730769:3
10 chr22 50795909 . AT A . PASS AVGCOV=17;MINCOV=17;ALTCOV=21;ZYG=het;COVRATIO=0.45;CHI2=0.42;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2741delT||||||INFO_REALIGN_3_PRIME;GCF=0.254902 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:1.01961:1.33333:0.83871:0
10 chr22 50795925 . AAAGT A . PASS AVGCOV=19;MINCOV=19;ALTCOV=27;ZYG=het;COVRATIO=0.41;CHI2=1.39;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2726_333-2723delTAAG||||||INFO_REALIGN_3_PRIME;GCF=0.247619 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:1.01961:1.33333:0.83871:0
10 chr22 50796118 . A ACATACTAT . PASS AVGCOV=20;MINCOV=20;ALTCOV=22;ZYG=het;COVRATIO=0.48;CHI2=0.1;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=ACATACTAT|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2534_333-2527dupTACTATCA||||||INFO_REALIGN_3_PRIME;GCF=0.336634 GT:AD:DP:DHFC:DHFFC:DHBFC 0/0:.:.:0.843137:1.16216:0.741379
10 chr22 50796247 . GA G . PASS AVGCOV=10;MINCOV=10;ALTCOV=13;ZYG=het;COVRATIO=0.43;CHI2=0.39;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2406delA||||||INFO_REALIGN_3_PRIME;GCF=0.470588 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:1.01961:1.44444:1:0
10 chr22 50796312 . GT G . PASS AVGCOV=3;MINCOV=3;ALTCOV=11;ZYG=het;COVRATIO=0.21;CHI2=4.57;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2342delT||||||;GCF=0.460784 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.607843:0.837838:0.596154:0
10 chr22 50796312 . GTC G . PASS AVGCOV=6;MINCOV=6;ALTCOV=16;ZYG=het;COVRATIO=0.27;CHI2=4.55;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2342_333-2341delTC||||||;GCF=0.456311 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.607843:0.837838:0.596154:0
8 chr22 50796312 . GTCCC G . LowAltCnt AVGCOV=4;MINCOV=4;ALTCOV=11;ZYG=het;COVRATIO=0.27;CHI2=3.27;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00
instead of piping to bgzip, can you instead just use -o $sample.vcf.gz
? that will be faster anyway and rule out piping issues.
That fixed it on the test sample! Must be some issue with closing the pipe. Thanks for pin pointing the issue so quickly.
I check a few of the samples that were not getting an error with tabix. Not all were getting the last few variants. They probably just happened to end with an EOL. So I will re-run it on all the samples.
tabix output/ADNI_002_S_4229.vcf.gz chr22:50799200|tail|awk '{print NF "\t" $0}'
10 chr22 50799200 . CAACTT C . PASS AVGCOV=9;MINCOV=9;ALTCOV=9;ZYG=het;COVRATIO=0.5;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.881_885delCTTAA||||||INFO_REALIGN_3_PRIME;GCF=0.216981 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.196078:0.277778:0.16129:1
10 chr22 50799573 . CAT C . LowAltCnt AVGCOV=3;MINCOV=3;ALTCOV=15;ZYG=het;COVRATIO=0.17;CHI2=8;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.1253_1254delTA||||||INFO_REALIGN_3_PRIME;GCF=0.291262 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.627451:0.842105:0.52459:0
[farrell@scc-hadoop duphold]$ zcat output/ADNI_002_S_4229.vcf.gz|tail|awk '{print NF "\t" $0}'
10 chr22 50797069 rs201907533 GA G . PASS AVGCOV=7;MINCOV=7;ALTCOV=0;ZYG=hom;COVRATIO=1;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1579delA||||||INFO_REALIGN_3_PRIME;CAF=[0.9876,0.0124];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;RS=201907533;RSPOS=51235498;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000200;WGT=1;dbSNPBuildID=137;GCF=0.254902 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.431373:0.814815:0.354839:0
10 chr22 50797489 . CAG C . PASS AVGCOV=8;MINCOV=8;ALTCOV=16;ZYG=het;COVRATIO=0.33;CHI2=2.67;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1165_333-1164delAG||||||;GCF=0.378641 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.529412:1.17391:0.490909:2
10 chr22 50797585 rs200507571 A AT . PASS AVGCOV=12;MINCOV=12;ALTCOV=15;ZYG=het;COVRATIO=0.44;CHI2=0.33;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=AT|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1064dupT||||||INFO_REALIGN_3_PRIME;CAF=[0.8962,0.1038];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;NOC;RS=200507571;RSPOS=51236013;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000210;WGT=1;dbSNPBuildID=137;GCF=0.306931 GT:AD:DP:DHFC:DHFFC:DHBFC 0/0:.:.:0.627451:1.3913:0.52459
10 chr22 50797605 . C CATTT . LowAltCnt;HighChi2score AVGCOV=3;MINCOV=3;ALTCOV=83;ZYG=het;COVRATIO=0.03;CHI2=74.42;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=CATTT|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1032_333-1029dupTTTA||||||INFO_REALIGN_3_PRIME;GCF=0.366337 GT:AD:DP:DHFC:DHFFC:DHBFC 0/0:.:.:0.627451:1.3913:0.551724
10 chr22 50797606 . ATTTAT A . PASS AVGCOV=16;MINCOV=16;ALTCOV=20;ZYG=het;COVRATIO=0.44;CHI2=0.44;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1046_333-1042delTATTT||||||INFO_REALIGN_3_PRIME;GCF=0.377358 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.627451:1.3913:0.581818:2
10 chr22 50798225 . ATACT A . PASS AVGCOV=8;MINCOV=8;ALTCOV=8;ZYG=het;COVRATIO=0.5;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-427_333-424delCTTA||||||INFO_REALIGN_3_PRIME;GCF=0.2 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.176471:0.36:0.147541:0
10 chr22 50798764 . AAAG A . PASS AVGCOV=5;MINCOV=5;ALTCOV=22;ZYG=het;COVRATIO=0.19;CHI2=10.7;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.447_449delAGA||||||INFO_REALIGN_3_PRIME;GCF=0.538462 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.607843:1.34783:0.645833:0
10 chr22 50798783 . CT C . PASS AVGCOV=17;MINCOV=17;ALTCOV=22;ZYG=het;COVRATIO=0.44;CHI2=0.64;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.462delT||||||;GCF=0.568627 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.607843:1.34783:0.645833:1
10 chr22 50799200 . CAACTT C . PASS AVGCOV=9;MINCOV=9;ALTCOV=9;ZYG=het;COVRATIO=0.5;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.881_885delCTTAA||||||INFO_REALIGN_3_PRIME;GCF=0.216981 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP 0/0:.:.:0.196078:0.277778:0.16129:1
10 chr22 50799573 . CAT C . LowAltCnt AVGCOV=3;MINCOV=3;ALTCOV=15;ZYG=het;COVRATIO=0.17;CHI2=8;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;A
I still don't know why/how this happened, but, if you weren't already, always run any script with pipes with the set -o pipefail
option in bash to make sure you catch the actual error.
There is no error triggered with that option.
I took a look at some of the code. I think it can be traced to this code.
https://github.com/brentp/hts-nim/blob/master/src/hts/vcf.nim
proc close*(v:VCF) =
if v.fname != "-" and v.fname != "/dev/stdin" and v.hts != nil:
if hts_close(v.hts) != 0:
when defined(debug):
stderr.write_line "[hts-nim] error closing vcf"
v.hts = nil
Probably just needs a flushFile(stdout) if v.fname='-'. .
yes. that's it. nicely spotted! I think it should just hts_close("/dev/stdin")
. I'll get this in the next release but you have a work-around for now. thanks for finding the solution.
here's a binary with that fixed. duphold.gz
this is fixed in latest release. thanks for reporting and testing.
I see a similar/same issue with release 0.2.1; changing from piping to bgzip to using -o sample.vcf.gz
fixed it for me as well. Happy to help debug this.
I don't see a way to fix this as it is explicitly closing the file and flushing in case of stdout. I did try adding an additional flush if you want to try the attached binary.
if this doesn't work, i'll probably add a check to ensure that stdout is not used.
Thanks for the reply; I'm seeing the same problem with the binary you posted. What happens is that in the pipe case, the last VCF record is cut off. Original and dev binary produce the same VCF in the pipe case and the -o
case respectively.
Last record with -o
:
chrUn_GL000218v1 54466 gridss314fb_2462h G ]chrUn_GL000218v1:54465]GTGCACACGTGTGTG 120.52 LOW_QUAL;NO_ASSEMBLY AS=0;ASC=1X;ASQ=0;ASRP=0;ASSR=0;BA=0;BANRP=0;BANRPQ=0;BANSR=0;BANSRQ=0;BAQ=0;BASRP=0;BASSR=0;BENAMES=NDX550213:6:HHHCCBGXF:1:21305:11427:10651;BMQ=47;BMQN=47;BMQX=47;BPNAMES=NDX550213:6:HHHCCBGXF:1:11201:6519:18766,NDX550213:6:HHHCCBGXF:2:12112:4327:15674,NDX550213:6:HHHCCBGXF:3:12407:2414:6309,NDX550213:6:HHHCCBGXF:4:23403:21325:6847;BQ=33.65;BSC=2;BSCQ=33.65;BUM=0;BUMQ=0;BVF=1;CAS=0;CASQ=0;CQ=120.52;EVENT=gridss314fb_2462;IC=4;IHOMPOS=0,0;IQ=120.52;MATEID=gridss314fb_2462o;MQ=32.25;MQN=24;MQX=40;RAS=0;RASQ=0;REF=19;REFPAIR=2;RP=0;RPQ=0;SB=0.666667;SC=1X111M;SR=0;SRQ=0;SVTYPE=BND;VF=4;GCF=0.39604 GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF:DHFC:DHFFC:DHBFC .:0:0:0:0:0:0:0:0:0:0:33.65:2:33.65:0:0:1:0:4:120.52:120.52:0:19:2:0:0:0:0:4:1.11765:1.58333:1
Piped to bgzip:
chrUn_GL000218v1 54466 gridss314fb_2462h G ]chrUn_GL000218v1:54465]GTGCACACGTGTGTG 120.52 LOW_QUAL;NO_ASSEMBLY AS=0;ASC=1X;ASQ=0;ASRP=0;ASSR=0;BA=0;BANRP=0;BANRPQ=0;BANSR=0;BANSRQ=0;BAQ=0;BASRP=0;BASSR=0;BENAMES=NDX550213:6:HHHCCBGXF:1:21305:11427:10651;BMQ=47;BMQN=47;BMQX=47;BPNAMES=NDX550213:6:HHHCCBGXF:1:11201:6519:18766,NDX550213:6:HHHCCBGXF:2:12112:4327:15674,NDX550213:6:HHHCCBGXF:3:12407:2414:6309,NDX550213:6:HHHCCBGXF:4:23403:21325:6847;BQ=33.65;BSC=2;BSCQ=33.65;BUM=0;BUMQ=0;BVF=1;CAS=0;CASQ=0;CQ=120.52;EVENT=gridss314fb_2462;IC=4;IHOMPOS=0,0;IQ=120.52;MATEID=gridss314fb_2462o;MQ=32.25;MQN=24;MQX=40;RAS=0;RASQ=0;REF=19;REFPAIR=2;RP=0;RPQ=0;SB=0.666667;SC=1X111M;SR=0;SRQ=0;SVTYPE=BND;VF=4;GCF=0.39604 GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RA
Piping into bgzip in the first place was a documentation issue for me; I didn't realize duphold could compress files directly before finding this PR, so perhaps this is what could be improved instead.
I am happy to accept a PR to improve docs.
I am running duphold on about 5000 crams. About 120 or so are finishing but the output vcf is not complete. It ends prematurely near the end of chromosome 22. There are only a few lines left to process of the imput vcf.
The last full vcf line output varies but all near the end. 1 50796757 24 50797069 24 50797489 7 50797585 7 50797605 4 50797606 11 50798225 14 50798764 25 50798783 20 50799200
here is an example of the end of one file...
Any suggestions?