Closed gabeng closed 6 years ago
Ben;
Thanks for the report. This is definitely unexpected and I'm not able to replicate with the standard bcbio tests that use GATK4. Are you finding all other commands get logged but the gatk
commands do not? At this point I'm not sure how to debug or provide suggestions; we don't do anything special with GATK commands in terms of logging and reporting. Maybe some additional information about but you're seeing would help identify the underlying issue.
Brad, I went back to bcbio 1.1.0 stable to be sure that this is not a temporary glitch.
I am used to the following debug output in bcbio-nextgen-debug.log
from GATK4 (this is from bcbio 1.0.9
with gatk 4.0.3.0
):
[2018-10-16T23:44Z] multiprocessing: calculate_sv_bins
[2018-10-16T23:44Z] multiprocessing: calculate_sv_coverage
[2018-10-16T23:44Z] multiprocessing: normalize_sv_coverage
[2018-10-16T23:44Z] Timing: hla typing
[2018-10-16T23:44Z] Resource requests: gatk, gatk-haplotype, picard; memory: 3.50, 3.00, 3.00; cores: 1, 32, 32
[2018-10-16T23:44Z] Configuring 64 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2018-10-16T23:44Z] Timing: alignment post-processing
[2018-10-16T23:44Z] multiprocessing: piped_bamprep
[2018-10-16T23:44Z] Timing: variant calling
[2018-10-16T23:44Z] multiprocessing: variantcall_sample
...
[2018-10-16T23:44Z] GATK: HaplotypeCaller
[2018-10-16T23:44Z] Using GATK jar /usr/local/share/bcbio-nextgen/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar
[2018-10-16T23:44Z] Running:
[2018-10-16T23:44Z] java -Dsamjdk.use_async_io_read_samtools=false [...]
Then typically follow lots of log messages from GATK4 for each call.
With bcbio 1.1.0 stable
with gatk 4.0.7.0
, however, no log messages from GATK4 appear in the bcbio-nextgen-debug.log
.
[2018-10-17T14:11Z] multiprocessing: calculate_sv_bins
[2018-10-17T14:11Z] multiprocessing: calculate_sv_coverage
[2018-10-17T14:11Z] multiprocessing: normalize_sv_coverage
[2018-10-17T14:11Z] Timing: hla typing
[2018-10-17T14:11Z] Resource requests: gatk, gatk-haplotype, picard; memory: 3.50, 3.00, 3.00; cores: 1, 16, 16
[2018-10-17T14:11Z] Configuring 16 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2018-10-17T14:11Z] Timing: alignment post-processing
[2018-10-17T14:11Z] multiprocessing: piped_bamprep
[2018-10-17T14:11Z] Timing: variant calling
[2018-10-17T14:11Z] multiprocessing: variantcall_sample
[2018-10-17T14:15Z] multiprocessing: concat_variant_files
[2018-10-17T14:15Z] Resource requests: bcbio_variation, fastqc, gatk, gemini, kraken, preseq, qsignature, sambamba, samtools; memory: 3.00, 3.00, 3.50, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00; cores: 16, 16, 1, 16, 16, 16, 16, 16, 16
[2018-10-17T14:15Z] Configuring 1 jobs to run, using 16 cores each with 56.1g of memory reserved for each job
[2018-10-17T14:15Z] Timing: joint squaring off/backfilling
There are no log messages at all from GATK4 which is a little strange.
There is no entry of GATK4 being executed in the bcbio-nextgen-commands.log
either. Output files (including VCF) are generated appropriately, though.
This is my config file:
---
details:
- analysis: variant2
genome_build: GRCh37
algorithm:
aligner: bwa
coverage_interval: regional
mark_duplicates: true
recalibrate: false
realign: false
align_split_size: false
effects: false
variantcaller: [gatk-haplotype]
tools_on: [bwa-mem, coverage_perbase]
tools_off: [gemini]
mixup_check: qsignature_full
save_diskspace: true
Ben; Sorry about the continued issues. Is it possible that your recent tests already have pre-existing files in the gatk-haplotype directory from an earlier run? I have no idea why GATK only logs would go away, but it would skip calling the commands, and thus both command and debug logs, if the files were already present.
If it's a clean/new directory and you're seeing commands run if you watch top but not logged, I'm at a total loss as to how that would change. Hope the pre-existing checkpoints file idea might resolve what's going on.
Brad, unfortunately, no. I am always testing entirely fresh runs starting from pulling the docker image etc. Is there any additional debug information that I could provide that may be helpful in getting to the bottom of this?
Ben;
Sorry about not having a clue on this. I'm not sure why I can't reproduce and understand what is going on in the environment to cause just GATK to fail logging. I guess the next best step would be to try the bcbio tests to see if they also show the problem. If you clone the main repo, change into the test suite (https://github.com/bcbio/bcbio-nextgen/tree/master/tests) and run ./run_tests.sh devel
, does that small example run have GATK logging?
Your latest update also mentions Docker, so maybe there is something in the way that you're running bcbio through the container that inadvertently triggers something with the logging and GATK. Could you provide more details about how you're running the analysis?
Thanks again for the help debugging.
Brad, we've been digging a little bit. The issue must have been created between these two versions: https://github.com/bcbio/bcbio-nextgen/compare/a26f5f9..v1.1.0 As I said before, we are using bcbio on docker on a single machine.
As I understand, spark is used for gatk4 by default: https://github.com/bcbio/bcbio-nextgen/blob/612d9a56d18d523ef3084c84a3c4eebee4308229/bcbio/variation/gatk.py#L103
The parameters for spark have changed, maybe this is related: https://github.com/bcbio/bcbio-nextgen/compare/a26f5f9..v1.1.0#diff-937a9675ab52f30313637c82de3e0507R123
Logging of the gatk4 command happens here, as far as I can see. https://github.com/bcbio/bcbio-nextgen/blob/e6cc8c0db2cf0da1a940fa6d9c8a0333fdacec3f/bcbio/provenance/do.py#L22 If this is printed to stdout, we see the correct log message. The logging as it is does not work, though.
Ben; Thanks for the debugging work. You're right on with all your exploration, that is where logging happens. However, we haven't had any changes to that framework over the commit range you reference, so I'm not seeing what could trigger the issue. We do use GATK Spark, but this shouldn't effect output for writing the commands run to the logs, so maybe it's worth first focusing on why that doesn't happen only for GATK commands but all other run commands.
How are you running bcbio inside of Docker? Are you using bcbio_vm.py, CWL, or directly running from inside the container in a custom way? Maybe we changed something in the Docker container or version of Docker, or how we're wrapping the runs?
Sorry to not have a better clue here but hope this helps some for exploring.
Brad, We are running bcbio_nextgen.py inside the docker container. No bcbio_vm.py or CWL.
Ben; Thanks for this. A couple of follow up questions to help me try and reproduce:
I'm trying to get at a minimal test case so I can try and see the problem on our runs and dig more into it. Thanks again for the help debugging.
Brad,
Brad,
I checked with the latest bcbio_docker image: If you start the bcbio/bcbio-vc:latest
image interactively (shell as entrypoint) and create a germline variant calling job, GATK4 log messages are missing here, too.
Note that we are not using bcbio_vm.py
to orchestrate docker containers, but rather using the bcbio inside docker as standalone. This seems to be the difference. Not sure if you intend to support this use case.
I have not been able to get bcbio_vm.py
to start yet. I would like to avoid downloading the entire data installation again, but bcbio_vm.py does not seem to be fine with my symlinks to existing data installations.
I was able to narrow it down a little further: https://github.com/bcbio/bcbio-nextgen/compare/3ddc267..a347377
Another note: log entries reappear when using single-core: -n 1
. As soon as I switch to multi-core with -n 4
, log messages are no longer written to stdout
and the command is not written to the command log. Are tests performed with single-core settings?
Brad,
I have now found the problem: it is caused by switching to a newer version of joblib
, which has changed its default parallel execution backend from multiprocessing
to loky
from version 0.12 onwards:
This backend relies on a robust implementation of concurrent.futures.ProcessPoolExecutor with spawned processes that can be reused accross the Parallel calls.
(from the 0.12 release notes, p 55 of the current docs)
Unfortunately, this new backend does not play well with the current way the logbook
log handlers use a queue for logging when running in a multi-core setting.
From the other available joblib
backends, only multiprocessing
and sequential
seem to work properly, and sequential
is used by default for single-size jobs (this is why some commands have been logged properly, but not BWA or GATK commands once bcbio is configured to use multiple cores, e.g. via bcbio_nextgen.py config/bcbio.yaml -n 4
).
Neither threading
nor the new default, loky
, works for the logging
It seems the easiest way to fix this for now would be to manually select the old default backend when calling joblib.Parallel
, which is supported:
“multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.
This would means chaning this line to
for data in joblib.Parallel(parallel["num_jobs"], batch_size=1, backend="multiprocessing")(joblib.delayed(fn)(x) for x in items):
I have tested this and the logging now seems to work as before.
Should I open a PR for this?
Ben;
Thanks so much for this detective work. Wow. I tried to replicate the issue you were seeing with the latest joblib (0.13) and the simple bcbio test we've been using (with -n 2
) and don't see the issue, at least running outside of the docker container. Can you replicate both inside and outside of the container or only inside? Practically, I'd be happy to merge a PR for this if it fixes it for you, as that's our main concern. Thanks again for all this incredible work debugging the issue.
Brad, we were able to replicate this outside of docker. We will put together a simple germline calling job to reproduce it.
Ben; Thanks for this. If using multiprocessing fixes the issues in your hands I'm happy to accept a PR, roll that fix in and move forward. I don't want you to spend more time having to dig into it. I was mostly curious as I don't understand why I can't reproduce, but pinning to the recent versions of joblib and defaulting back to multiprocessing works totally fine. Thanks again.
Hi Brad,
Thank you for your help. I've just opened #2569 with our fix. It's really strange that you can't reproduce this locally. We've checked the joblib
issues regarding logging, but these refer to other backends and seem not really related. Maybe we can simply revert this once the loky
backend is more mature.
Roland and Ben; Thank you again for all the incredible work tracking this down and the fix. I'm really happy we managed to get this one resolved, and hopefully runs and logging are cleaner now for you going forward.
Hi Brad, Thank you so much for your help in tracking this down, and for merging this!
I am currently playing around with bcbio-1.1.1a from yesterday and noticed that there are no log messages from GATK4 in the log files. Neither in the bcbio log nor in the commands log. It is being executed, though.