GATK4 log messages missing

gabeng commented 6 years ago

I am currently playing around with bcbio-1.1.1a from yesterday and noticed that there are no log messages from GATK4 in the log files. Neither in the bcbio log nor in the commands log. It is being executed, though.

chapmanb commented 6 years ago

Ben; Thanks for the report. This is definitely unexpected and I'm not able to replicate with the standard bcbio tests that use GATK4. Are you finding all other commands get logged but the gatk commands do not? At this point I'm not sure how to debug or provide suggestions; we don't do anything special with GATK commands in terms of logging and reporting. Maybe some additional information about but you're seeing would help identify the underlying issue.

gabeng commented 6 years ago

Brad, I went back to bcbio 1.1.0 stable to be sure that this is not a temporary glitch.

bcbio 1.0.9

I am used to the following debug output in bcbio-nextgen-debug.log from GATK4 (this is from bcbio 1.0.9 with gatk 4.0.3.0):

[2018-10-16T23:44Z] multiprocessing: calculate_sv_bins
[2018-10-16T23:44Z] multiprocessing: calculate_sv_coverage
[2018-10-16T23:44Z] multiprocessing: normalize_sv_coverage
[2018-10-16T23:44Z] Timing: hla typing
[2018-10-16T23:44Z] Resource requests: gatk, gatk-haplotype, picard; memory: 3.50, 3.00, 3.00; cores: 1, 32, 32
[2018-10-16T23:44Z] Configuring 64 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2018-10-16T23:44Z] Timing: alignment post-processing
[2018-10-16T23:44Z] multiprocessing: piped_bamprep
[2018-10-16T23:44Z] Timing: variant calling
[2018-10-16T23:44Z] multiprocessing: variantcall_sample
...
[2018-10-16T23:44Z] GATK: HaplotypeCaller
[2018-10-16T23:44Z] Using GATK jar /usr/local/share/bcbio-nextgen/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar
[2018-10-16T23:44Z] Running:
[2018-10-16T23:44Z]     java -Dsamjdk.use_async_io_read_samtools=false [...]

Then typically follow lots of log messages from GATK4 for each call.

bcbio 1.1.0

With bcbio 1.1.0 stable with gatk 4.0.7.0, however, no log messages from GATK4 appear in the bcbio-nextgen-debug.log.

[2018-10-17T14:11Z] multiprocessing: calculate_sv_bins
[2018-10-17T14:11Z] multiprocessing: calculate_sv_coverage
[2018-10-17T14:11Z] multiprocessing: normalize_sv_coverage
[2018-10-17T14:11Z] Timing: hla typing
[2018-10-17T14:11Z] Resource requests: gatk, gatk-haplotype, picard; memory: 3.50, 3.00, 3.00; cores: 1, 16, 16
[2018-10-17T14:11Z] Configuring 16 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2018-10-17T14:11Z] Timing: alignment post-processing
[2018-10-17T14:11Z] multiprocessing: piped_bamprep
[2018-10-17T14:11Z] Timing: variant calling
[2018-10-17T14:11Z] multiprocessing: variantcall_sample
[2018-10-17T14:15Z] multiprocessing: concat_variant_files
[2018-10-17T14:15Z] Resource requests: bcbio_variation, fastqc, gatk, gemini, kraken, preseq, qsignature, sambamba, samtools; memory: 3.00, 3.00, 3.50, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00; cores: 16, 16, 1, 16, 16, 16, 16, 16, 16
[2018-10-17T14:15Z] Configuring 1 jobs to run, using 16 cores each with 56.1g of memory reserved for each job
[2018-10-17T14:15Z] Timing: joint squaring off/backfilling

There are no log messages at all from GATK4 which is a little strange.

There is no entry of GATK4 being executed in the bcbio-nextgen-commands.log either. Output files (including VCF) are generated appropriately, though.

This is my config file:

---
details:
  - analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      coverage_interval: regional    
      mark_duplicates: true
      recalibrate: false
      realign: false
      align_split_size: false
      effects: false
      variantcaller: [gatk-haplotype]
      tools_on: [bwa-mem, coverage_perbase]
      tools_off: [gemini]
      mixup_check: qsignature_full
      save_diskspace: true

chapmanb commented 6 years ago

Ben; Sorry about the continued issues. Is it possible that your recent tests already have pre-existing files in the gatk-haplotype directory from an earlier run? I have no idea why GATK only logs would go away, but it would skip calling the commands, and thus both command and debug logs, if the files were already present.

If it's a clean/new directory and you're seeing commands run if you watch top but not logged, I'm at a total loss as to how that would change. Hope the pre-existing checkpoints file idea might resolve what's going on.

gabeng commented 6 years ago

Brad, unfortunately, no. I am always testing entirely fresh runs starting from pulling the docker image etc. Is there any additional debug information that I could provide that may be helpful in getting to the bottom of this?

chapmanb commented 6 years ago

Ben; Sorry about not having a clue on this. I'm not sure why I can't reproduce and understand what is going on in the environment to cause just GATK to fail logging. I guess the next best step would be to try the bcbio tests to see if they also show the problem. If you clone the main repo, change into the test suite (https://github.com/bcbio/bcbio-nextgen/tree/master/tests) and run ./run_tests.sh devel, does that small example run have GATK logging?

Your latest update also mentions Docker, so maybe there is something in the way that you're running bcbio through the container that inadvertently triggers something with the logging and GATK. Could you provide more details about how you're running the analysis?

Thanks again for the help debugging.

gabeng commented 6 years ago

Brad, we've been digging a little bit. The issue must have been created between these two versions: https://github.com/bcbio/bcbio-nextgen/compare/a26f5f9..v1.1.0 As I said before, we are using bcbio on docker on a single machine.

As I understand, spark is used for gatk4 by default: https://github.com/bcbio/bcbio-nextgen/blob/612d9a56d18d523ef3084c84a3c4eebee4308229/bcbio/variation/gatk.py#L103

The parameters for spark have changed, maybe this is related: https://github.com/bcbio/bcbio-nextgen/compare/a26f5f9..v1.1.0#diff-937a9675ab52f30313637c82de3e0507R123

Logging of the gatk4 command happens here, as far as I can see. https://github.com/bcbio/bcbio-nextgen/blob/e6cc8c0db2cf0da1a940fa6d9c8a0333fdacec3f/bcbio/provenance/do.py#L22 If this is printed to stdout, we see the correct log message. The logging as it is does not work, though.

chapmanb commented 6 years ago

Ben; Thanks for the debugging work. You're right on with all your exploration, that is where logging happens. However, we haven't had any changes to that framework over the commit range you reference, so I'm not seeing what could trigger the issue. We do use GATK Spark, but this shouldn't effect output for writing the commands run to the logs, so maybe it's worth first focusing on why that doesn't happen only for GATK commands but all other run commands.

How are you running bcbio inside of Docker? Are you using bcbio_vm.py, CWL, or directly running from inside the container in a custom way? Maybe we changed something in the Docker container or version of Docker, or how we're wrapping the runs?

Sorry to not have a better clue here but hope this helps some for exploring.

gabeng commented 6 years ago

Brad, We are running bcbio_nextgen.py inside the docker container. No bcbio_vm.py or CWL.

chapmanb commented 6 years ago

Ben; Thanks for this. A couple of follow up questions to help me try and reproduce:

How are you handling the work directory and external data? Are you mounting into the container or writing internally?
Which version of the bcbio Docker container do you use? Is it one from the quay.io builds (https://quay.io/repository/bcbio/bcbio-vc?tab=tags).
When you run the test I mentioned above, do you also see the GATK logs dropping out?

I'm trying to get at a minimal test case so I can try and see the problem on our runs and dig more into it. Thanks again for the help debugging.

gabeng commented 6 years ago

Brad,

I am mounting the data from a local drive into the container. The data is not inside the docker container.
At some point we adopted a docker file that you created. I have to check if this deviates from the current quay.io version.
I have not run that test and will get back to you once I have done it.

gabeng commented 6 years ago

Brad, I checked with the latest bcbio_docker image: If you start the bcbio/bcbio-vc:latest image interactively (shell as entrypoint) and create a germline variant calling job, GATK4 log messages are missing here, too. Note that we are not using bcbio_vm.py to orchestrate docker containers, but rather using the bcbio inside docker as standalone. This seems to be the difference. Not sure if you intend to support this use case. I have not been able to get bcbio_vm.py to start yet. I would like to avoid downloading the entire data installation again, but bcbio_vm.py does not seem to be fine with my symlinks to existing data installations.

I was able to narrow it down a little further: https://github.com/bcbio/bcbio-nextgen/compare/3ddc267..a347377

Another note: log entries reappear when using single-core: -n 1. As soon as I switch to multi-core with -n 4, log messages are no longer written to stdoutand the command is not written to the command log. Are tests performed with single-core settings?

gabeng commented 6 years ago

Brad, I have now found the problem: it is caused by switching to a newer version of joblib, which has changed its default parallel execution backend from multiprocessing to loky from version 0.12 onwards:

This backend relies on a robust implementation of concurrent.futures.ProcessPoolExecutor with spawned processes that can be reused accross the Parallel calls.

(from the 0.12 release notes, p 55 of the current docs)

Unfortunately, this new backend does not play well with the current way the logbook log handlers use a queue for logging when running in a multi-core setting.

From the other available joblib backends, only multiprocessing and sequential seem to work properly, and sequential is used by default for single-size jobs (this is why some commands have been logged properly, but not BWA or GATK commands once bcbio is configured to use multiple cores, e.g. via bcbio_nextgen.py config/bcbio.yaml -n 4).

Neither threading nor the new default, loky, works for the logging

It seems the easiest way to fix this for now would be to manually select the old default backend when calling joblib.Parallel, which is supported:

“multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.

This would means chaning this line to

for data in joblib.Parallel(parallel["num_jobs"], batch_size=1, backend="multiprocessing")(joblib.delayed(fn)(x) for x in items):

I have tested this and the logging now seems to work as before.

Should I open a PR for this?

chapmanb commented 6 years ago

Ben; Thanks so much for this detective work. Wow. I tried to replicate the issue you were seeing with the latest joblib (0.13) and the simple bcbio test we've been using (with -n 2) and don't see the issue, at least running outside of the docker container. Can you replicate both inside and outside of the container or only inside? Practically, I'd be happy to merge a PR for this if it fixes it for you, as that's our main concern. Thanks again for all this incredible work debugging the issue.

gabeng commented 6 years ago

Brad, we were able to replicate this outside of docker. We will put together a simple germline calling job to reproduce it.

chapmanb commented 6 years ago

Ben; Thanks for this. If using multiprocessing fixes the issues in your hands I'm happy to accept a PR, roll that fix in and move forward. I don't want you to spend more time having to dig into it. I was mostly curious as I don't understand why I can't reproduce, but pinning to the recent versions of joblib and defaulting back to multiprocessing works totally fine. Thanks again.

roland-ewald commented 6 years ago

Hi Brad, Thank you for your help. I've just opened #2569 with our fix. It's really strange that you can't reproduce this locally. We've checked the joblib issues regarding logging, but these refer to other backends and seem not really related. Maybe we can simply revert this once the loky backend is more mature.

chapmanb commented 6 years ago

Roland and Ben; Thank you again for all the incredible work tracking this down and the fix. I'm really happy we managed to get this one resolved, and hopefully runs and logging are cleaner now for you going forward.

roland-ewald commented 6 years ago

Hi Brad, Thank you so much for your help in tracking this down, and for merging this!

bcbio / bcbio-nextgen

GATK4 log messages missing #2520

bcbio 1.0.9

bcbio 1.1.0