freebayes stuck - Githubissues

pengxiao78 commented 7 years ago

Hi Brad, I found one sample in my batch stuck during the freebayes variant calling in the cancer somatic variant calling pipeline. I deleted this sample in the yaml file, the remaining pairs all finished the analyses successfully. Then I just ran this sample individually but still found the running just stuck at the same point. The paralleled nodes' status is running but the status of the log just stuck there forever. Previously, I canceled all the nodes and re-ran the job, the bcbio will also stuck at this sample at freebayes. This time, I increased java, gatk, freebayes max memory into 16g but the running still got stuck at freebayes. Can you help me with the debugging? Thanks!

Initial stuck: As follows is the last part of the log stuck there [2017-02-08T02:15Z] ica3: tabix index 200-17_31018570_81195210.vcf.gz [2017-02-08T02:15Z] ica3: GATK: VariantAnnotator [2017-02-08T02:15Z] ica3: INFO 20:15:06,822 HelpFormatter - --------------------------------------------------------------------------------- [2017-02-08T02:15Z] ica3: INFO 20:15:06,824 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 [2017-02-08T02:15Z] ica3: INFO 20:15:06,824 HelpFormatter - Copyright (c) 2010 The Broad Institute [2017-02-08T02:15Z] ica3: INFO 20:15:06,824 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk [2017-02-08T02:15Z] ica3: INFO 20:15:06,827 HelpFormatter - Program Args: -T VariantAnnotator -R /.../bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --variant /.../200/work/freebayes/17/200-17_31018570_81195210.vcf.gz --out /.../200/work/bcbiotx/tmpzk2fin/200-17_31018570_81195210-gatkann.vcf.gz -L /.../200/work/freebayes/17/200-17_31018570_81195210.vcf.gz --dbsnp /.../bcbio/genomes/Hsapiens/GRCh37/variation/dbsnp-147.vcf.gz -I /.../200/work/bamprep/200N/17/200N-sort-17_31018570_81195210-prep.bam -I /.../200/work/bamprep/200T/17/200T-sort-17_31018570_81195210-prep.bam -A BaseQualityRankSumTest -A FisherStrand -A GCContent -A HaplotypeScore -A HomopolymerRun -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth -A ReadPosRankSumTest -A RMSMappingQuality --allow_potentially_misencoded_quality_scores -U ALL --read_filter BadCigar --read_filter NotPrimaryAlignment [2017-02-08T02:15Z] ica3: INFO 20:15:06,829 HelpFormatter - Executing as peng.xiao@ica3 on Linux 3.10.0-327.36.3.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_121-mockbuild_2016_11_21_19_08-b00. [2017-02-08T02:15Z] ica3: INFO 20:15:06,829 HelpFormatter - Date/Time: 2017/02/07 20:15:06 [2017-02-08T02:15Z] ica3: INFO 20:15:06,830 HelpFormatter - --------------------------------------------------------------------------------- [2017-02-08T02:15Z] ica3: INFO 20:15:06,830 HelpFormatter - --------------------------------------------------------------------------------- [2017-02-08T02:15Z] ica3: INFO 20:15:06,901 GenomeAnalysisEngine - Strictness is SILENT [2017-02-08T02:15Z] ica3: INFO 20:15:07,086 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 [2017-02-08T02:15Z] ica3: INFO 20:15:07,095 SAMDataSource$SAMReaders - Initializing SAMRecords in serial [2017-02-08T02:15Z] ica3: INFO 20:15:07,186 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.09 [2017-02-08T02:15Z] ica3: INFO 20:15:08,708 IntervalUtils - Processing 38862 bp from intervals [2017-02-08T02:15Z] ica3: WARN 20:15:08,715 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation [2017-02-08T02:15Z] ica3: WARN 20:15:08,716 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation [2017-02-08T02:15Z] ica3: INFO 20:15:08,760 GenomeAnalysisEngine - Preparing for traversal over 2 BAM files [2017-02-08T02:15Z] ica3: INFO 20:15:09,102 GenomeAnalysisEngine - Done preparing for traversal [2017-02-08T02:15Z] ica3: INFO 20:15:09,103 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] [2017-02-08T02:15Z] ica3: INFO 20:15:09,103 ProgressMeter - | processed | time | per 1M | | total | remaining [2017-02-08T02:15Z] ica3: INFO 20:15:09,103 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime [2017-02-08T02:15Z] ica3: INFO 20:15:09,147 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. [2017-02-08T02:15Z] ica3: WARN 20:15:09,415 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator [2017-02-08T02:15Z] ica3: INFO 20:15:39,112 ProgressMeter - 17:43838754 104244.0 30.0 s 4.8 m 26.9% 111.0 s 81.0 s [2017-02-08T02:16Z] ica3: INFO 20:16:09,120 ProgressMeter - 17:66074685 231803.0 60.0 s 4.3 m 64.4% 93.0 s 33.0 s [2017-02-08T02:16Z] ica3: INFO 20:16:39,124 ProgressMeter - 17:80043960 358535.0 90.0 s 4.2 m 96.3% 93.0 s 3.0 s [2017-02-08T02:16Z] ica3: INFO 20:16:43,071 VariantAnnotator - Processed 36441 loci. [2017-02-08T02:16Z] ica3: [2017-02-08T02:16Z] ica3: INFO 20:16:43,097 ProgressMeter - done 378924.0 93.0 s 4.1 m 100.0% 93.0 s 0.0 s [2017-02-08T02:16Z] ica3: INFO 20:16:43,097 ProgressMeter - Total runtime 93.99 secs, 1.57 min, 0.03 hours [2017-02-08T02:16Z] ica3: INFO 20:16:43,097 MicroScheduler - 199001 reads were filtered out during the traversal out of approximately 980809 total reads (20.29%) [2017-02-08T02:16Z] ica3: INFO 20:16:43,097 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter [2017-02-08T02:16Z] ica3: INFO 20:16:43,098 MicroScheduler - -> 198993 reads (20.29% of total) failing DuplicateReadFilter [2017-02-08T02:16Z] ica3: INFO 20:16:43,098 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter [2017-02-08T02:16Z] ica3: INFO 20:16:43,098 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter [2017-02-08T02:16Z] ica3: INFO 20:16:43,098 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter [2017-02-08T02:16Z] ica3: INFO 20:16:43,098 MicroScheduler - -> 8 reads (0.00% of total) failing UnmappedReadFilter [2017-02-08T02:16Z] ica3: INFO 20:16:48,904 GATKRunReport - Uploaded run statistics report to AWS S3

After increasing the memory into 16g, still got stuck as follows,

[2017-02-08T17:44Z] ica4: Aligning lane 2_2017-02-07_200_s-merged with bwa aligner [2017-02-08T17:44Z] ica1: ipython: delayed_bam_merge [2017-02-08T17:44Z] ica1: ipython: merge_split_alignments [2017-02-08T17:44Z] ica1: Timing: callable regions [2017-02-08T17:44Z] ica1: ipython: prep_samples [2017-02-08T17:44Z] ica1: ipython: postprocess_alignment [2017-02-08T17:44Z] ica2: 200N: Assigned coverage as 'regional' with 3.8% genome coverage and 0.0% offtarget coverage [2017-02-08T17:44Z] ica2: Recalibrating 200N with GATK [2017-02-08T17:44Z] ica3: 200T: Assigned coverage as 'regional' with 6.0% genome coverage and 0.0% offtarget coverage [2017-02-08T17:44Z] ica3: Recalibrating 200T with GATK [2017-02-08T17:44Z] ica1: ipython: combine_sample_regions [2017-02-08T17:44Z] ica4: Identified 164 parallel analysis blocks Block sizes: min: 4262 5%: 23301.9 25%: 157158.75 median: 31018063.0 75%: 31019240.75 95%: 32956143.35 99%: 62206026.13 max: 93056784 Between block sizes: min: 255 5%: 271.0 25%: 346.0 median: 734.0 75%: 1335.0 95%: 6141.0 99%: 10000000.0 max: 10000000

[2017-02-08T17:44Z] ica1: Timing: hla typing [2017-02-08T17:44Z] ica1: ipython: call_hla [2017-02-08T17:44Z] ica1: Resource requests: freebayes, gatk, mutect, picard, vardict, varscan; memory: 16.00, 16.00, 16.00, 16.00, 16.00; cores: 16, 1, 1, 16, 16, 16 [2017-02-08T17:44Z] ica1: Configuring 60 jobs to run, using 1 cores each with 16.00g of memory reserved for each job [2017-02-08T17:44Z] ica1: Timing: alignment post-processing [2017-02-08T17:44Z] ica1: ipython: piped_bamprep [2017-02-08T17:44Z] ica1: Timing: variant calling [2017-02-08T17:44Z] ica1: ipython: variantcall_sample [2017-02-08T17:44Z] ica4: Genotyping paired variants with FreeBayes

chapmanb commented 7 years ago

Thanks for the report and sorry about the issue. My guess is that FreeBayes is taking a long time on a single region/sample so what you're seeing is that job being launched and taking a long time to run. If you ssh into ica4 and run top, is there are FreeBayes process running on the machine?

If so, I'd guess there's something about this sample/region that's causing it to be very slow. It's hard to debug this remotely but your best bet would be to look at log/bcbio-nextgen-commands.log to see what region is running and then determine if you can isolate to a smaller region that shows the problem. Hope this helps some.

pengxiao78 commented 7 years ago

Hi Brad, When I ssh ica4 and ran top, I did not find freebayes process running on the machine. I had four variant callers mutect, freebayes, vardict, and varscan and used ensemble numpass 2. In the debugging run, I deleted freebayes and kept the other three and found the pipeline finished with no error. Is this the problem of freebayes although the problem just happened on this sample? Thanks!

chapmanb commented 7 years ago

Thanks for the additional debugging. It sounds like there is something problematic with freebayes and this sample, but we'd need to isolate the failing command to help debug further. If you're able to keep investigating the next step would be to find the last freebayes command run that hangs/dies and try to run it outside of bcbio to see if we can get any other useful indication about what is going on. Then we could try to come up with a reproducible example and explore upstream with the FreeBayes team. Thanks much.

bcbio / bcbio-nextgen

freebayes stuck #1809