Closed amizeranschi closed 6 years ago
Glad the CWL tests are now passing and thanks for the work debugging. The files you'd need for reproducing would be the gVCF input files. These are present (or at least linked to) in the inputs directory of different steps:
/cromwell-executions/main-testingVC-merged.cwl/20f07690-6792-4bf5-80af-52b937112b35/call-jointcall/shard-0/wf-jointcall.cwl/680304fd-104e-4ab4-ae35-a692ffd75652/call-run_jointvc/shard-7/inputs
The transactional directories get removed when tasks end so you don't expect those to have files. We'd need a couple of inputs gVCFs that are failing and then could run the gvcfgenotyper command outside of bcbio to try and reproduce. Thanks again for the help.
Thanks for clarifying the location of the input files. I've just sent them by e-mail. I hope they will be useful.
Thanks for passing these along; I appreciate the help debugging. Are these the input to one of the failing steps using gvcfgenotyper? The only input VCF there is not a gVCF but rather a combined and called VCF file with two samples, so I'm confused. I wouldn't expect you to have these if the gvcfgenotyper step failed earlier. Are these the input files from the path I referenced or a different one? Also, could you share your configuration for the actual run? Maybe I'm missing something in reproducing and understanding the issue. Thanks again for all the help.
Yes, those are the files from the directory you mentioned in your previous post (/home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged__failed-strelka2/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/20f07690-6792-4bf5-80af-52b937112b35/call-jointcall/shard-0/wf-jointcall.cwl/680304fd-104e-4ab4-ae35-a692ffd75652/call-run_jointvc/shard-7/inputs/
).
This is the YAML template:
# Template for whole genome Illumina variant calling with GATK pipeline
---
details:
- analysis: variant2
genome_build: sacCer3
algorithm:
aligner: bwa
mark_duplicates: true
recalibrate: false
realign: false
effects: false
tools_off: gemini
variantcaller: strelka2
jointcaller: strelka2-joint
svcaller: [manta]
ploidy: 1
variant_regions: ../config/variant_regions.bed
This is the CSV file:
samplename,description,batch
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample1-R1-L001.fastq.gz,sample1,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample1-R1-L002.fastq.gz,sample1,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample1-R2-L001.fastq.gz,sample1,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample1-R2-L002.fastq.gz,sample1,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample2-R1-L001.fastq.gz,sample2,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample2-R1-L002.fastq.gz,sample2,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample2-R2-L001.fastq.gz,sample2,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample2-R2-L002.fastq.gz,sample2,testingVC
And this is the YAML config file:
details:
- algorithm:
aligner: bwa
effects: false
jointcaller: strelka2-joint
mark_duplicates: true
ploidy: 1
realign: false
recalibrate: false
svcaller:
- manta
tools_off: gemini
variant_regions: ../config/variant_regions.bed
variantcaller: strelka2
analysis: variant2
description: sample1
files:
- /home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged/sample1_R1.fastq.gz
- /home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged/sample1_R2.fastq.gz
genome_build: sacCer3
metadata:
batch: testingVC
- algorithm:
aligner: bwa
effects: false
jointcaller: strelka2-joint
mark_duplicates: true
ploidy: 1
realign: false
recalibrate: false
svcaller:
- manta
tools_off: gemini
variant_regions: ../config/variant_regions.bed
variantcaller: strelka2
analysis: variant2
description: sample2
files:
- /home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged/sample2_R1.fastq.gz
- /home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged/sample2_R2.fastq.gz
genome_build: sacCer3
metadata:
batch: testingVC
fc_name: testingVC-merged
upload:
dir: ../final
I tried running the same test (Yeast data + CWL with Cromwell), but I've replaced Strelka2 with GATK's HaplotypeCaller. It's been running for almost 7 hours now and I'm not sure if it's still working or just stuck somehow.
The last lines of the output so far are:
# return exit code
exit $rc
[2018-11-30 12:00:56,75] [info] WorkflowExecutionActor-254691bd-cc5e-4c38-beeb-2f3f5f8b9a4f [254691bd]: Starting normalize_sv_coverage
[2018-11-30 12:00:57,23] [info] BackgroundConfigAsyncJobExecutionActor [30fc7c81variantcall_batch_region:0:1]: job id: 10055
[2018-11-30 12:00:57,24] [info] BackgroundConfigAsyncJobExecutionActor [30fc7c81variantcall_batch_region:0:1]: Status change from - to WaitingForReturnCodeFile
Is there a way to check if this is still running or just stuck?
For comparison, the same workflow (with Strelka2) took 20 minutes when running on a local Bcbio_nextgen copy, and around an hour when running with Bcbio_vm and Docker (no CWL). Strelka2 has run aprox. 2X faster than GATK's HaplotypeCaller for me, locally, in Bcbio_nextgen (tested a couple of weeks ago).
Is it expected for Bcbio_vm+Docker+CWL to run significantly slower than Bcbio_nextgen or the older Bcbio_vm+Docker workflow (now CWL)?
Thanks much for the example, this was a big help to identify the issue. If you add tools_on: [gvcf]
to your configuration, it should now do the right thing. In CWL you can drop the jointcaller
input approach and just add this for do gVCF based joint calling. This seemed like a cleaner approach as the dual variantcaller and jointcaller approach ended up being confusing.
However, the way you specified should work and not fail, so I've also pushed fixes to make that happen. Those should be in place by tomorrow but in the short term just removing jointcaller
and adding tools_on: [gvcf]
should proceed as you expect.
For the GATK issue, I'm not sure if this is a symptom of the same underlying problem or if something else is happening. You should be able to use top to check if there is anything running, which is the most straightforward way. In general, we don't expect the CWL approach to be significantly slower than the older approach so taking this much longer doesn't seem right. For smaller runs on a single machine CWL might be slower depending on the core allocation, but should be relatively similar.
Hope this all helps and gets your analysis finally finished. Thanks again for all the work and patience.
I've been checking htop
and I can see lines showing that stuff is happening (e.g. /usr/bin/containerd
or java lines running cromwell.jar
). I'm just surprised at the rather long runtime for this.
Just to check, should the jointcaller
option only be taken out for CWL, or is it still useful for Bcbio_nextgen and Bcbio_vm+Docker?
I'll test this tomorrow and report back.
Adding the option tools_on: [gvcf]
got Strelka2 to run fine for me, both with and without the jointcaller
option. Furthermore, the wall time was approx. the same as the one I had with Bcbio_nextgen, which is great news (around 20 minutes). GATK's HaplotypeCaller also ran fine (only tested with tools_on: gvcf
and also jointcaller: gatk-haplotype-joint
) and its wall time was around 30 minutes (50% slower than Strelka2).
Is there any good reason to still use HaplotypeCaller instead of Strelka2?
One thing I noticed was that the upload directory seems to have changed when using CWL, compared with what I've expected from Bcbio_nextgen. This is what the directory structure looks like for my test:
├── testingVC-merged
│ ├── config
│ ├── final
│ ├── work
│ │ ├── cromwell_work
│ │ │ ├── cromwell-executions
│ │ │ ├── cromwell-workflow-logs
│ │ │ ├── final
│ │ │ │ ├── project
│ │ │ │ ├── sample1
│ │ │ │ ├── sample2
│ │ │ ├── persist
│ │ ├── testingVC-merged-workflow
To run both Bcbio_nextgen and Bcbio_vm+CWL, I cd into testingVC-merged/work
and run either
bcbio_nextgen.py ../config/<config>.yaml -n <cores>
or
bcbio_vm.py cwlrun cromwell testingVC-merged-workflow
Bcbio_nextgen seems to set the upload directory to testingVC-merged/final
. However, Bcbio_vm+CWL copies files to testingVC-merged/work/cromwell_work/final
.
In addition, the name of the directory where the joint VCF file and multiqc results are copied differs between Bcbio_nextgen and Bcbio_vm+CWL. With the former, that directory was called something like <run_date>_<exp_name>
, e.g. 2018-11-12_testingVC-merged
in my example. Meanwhile, for Bcbio_vm+CWL, this is simply called project
.
I also noticed a file called testingVC-manta.vcf.gz
inside testingVC-merged/work/cromwell_work/final/project
. Is this meant to be a multi-sample a joint-VCF file for sv calls, similarly to what's done for small variants called by Strelka2? The file was empty for my test case, (as I never bothered to set up a realistic test case for large variants) and I only wanted to see if Manta actually ran successfully.
WIth Bcbio_nextgen, I expected Manta VCFs to be in the testingVC-merged/final/sample1
and testingVC-merged/final/sample1
directories. With Bcbio_vm+CWL, I see that all the VCF files are in testingVC-merged/work/cromwell_work/final/project
and only the BAM files and other metadata are in testingVC-merged/work/cromwell_work/final/sample1
and testingVC-merged/work/cromwell_work/final/sample2
.
What's the plan with the two workflows (Bcbio_nextgen and Bcbio_vm+CWL)? Are they going to diverge in terms of directory set-up and output file locations? Is Bcbio_vm+CWL going to replace Bcbio_nextgen in the future?
It's brilliant to hear this is now working for you. Thanks for all the patience getting it running and for the usage report. In terms of GATK HaplotypeCaller and strelka2, both are good choices for joint variant calling. They do have different intermediate gVCF formats so you might also make a decision based on other platforms or analyses you want to integrate with.
Thanks also for the detailed investigation of the output directory. It's expected to be non-identical for the CWL case. Replicating some of what we did previously was difficult since we get the results in a different manner from the CWL runner so can't allocate to sample and project directories in the exact same way. We were also hoping to clean up some of the historical cruft and confusion in the upload directory. Please let us know if anything is problematic and we can investigate.
For manta, you're right that is should be a merged VCF of the batched samples. It's not joint called, but rather called together within a group. Hope that works as expected when you're able to put together a realistic case for it.
Thanks again for all the feedback and work on getting this going.
OK, great. Thanks for all your help!
Hello,
After upgrading to the latest development version, the logs and joint VCF file generation don't seem to work properly anymore. Debug messages don't get printed anymore (neither on stdout, nor in the log file), and the
bcbio-nextgen-debug.log
file is pretty much identical withbcbio-nextgen.log
. The only difference is the resource requests messages which appear in the debug log:The multi-sample
<batch>-gatk-haplotype-joint-annotated.vcf.gz
did not get generated, even though the sample-specific VCF files are where they should be.Furthermore,
bcbio-nextgen-commands.log
is completely empty.To test all of this, I've run a simple variant calling job that worked flawlessly a few days ago, before upgrading Bcbio-nextgen.