bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

Problems with logs and joint VCF file generation in latest dev build #2473

Closed amizeranschi closed 6 years ago

amizeranschi commented 6 years ago

Hello,

After upgrading to the latest development version, the logs and joint VCF file generation don't seem to work properly anymore. Debug messages don't get printed anymore (neither on stdout, nor in the log file), and the bcbio-nextgen-debug.log file is pretty much identical with bcbio-nextgen.log. The only difference is the resource requests messages which appear in the debug log:

[2018-08-02T10:10Z] Resource requests: bwa, sambamba, samtools; memory: 3.00, 3.00, 3.00; cores: 16, 16, 16
[2018-08-02T10:10Z] Configuring 2 jobs to run, using 16 cores each with 48.1g of memory reserved for each job
[2018-08-02T10:10Z] Resource requests: gatk, gatk-haplotype, picard; memory: 3.50, 3.00, 3.00; cores: 1, 16, 16
[2018-08-02T10:10Z] Configuring 32 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2018-08-02T10:18Z] Resource requests: bcbio_variation, fastqc, gatk, gatk-vqsr, gemini, kraken, preseq, qsignature, sambamba, samtools; memory: 3.00, 3.00, 3.50, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00; cores: 16, 16, 1, 16, 16, 16, 16, 16, 16, 16
[2018-08-02T10:18Z] Configuring 2 jobs to run, using 16 cores each with 56.1g of memory reserved for each job

The multi-sample <batch>-gatk-haplotype-joint-annotated.vcf.gz did not get generated, even though the sample-specific VCF files are where they should be.

Furthermore, bcbio-nextgen-commands.log is completely empty.

To test all of this, I've run a simple variant calling job that worked flawlessly a few days ago, before upgrading Bcbio-nextgen.

chapmanb commented 6 years ago

Glad the CWL tests are now passing and thanks for the work debugging. The files you'd need for reproducing would be the gVCF input files. These are present (or at least linked to) in the inputs directory of different steps:

/cromwell-executions/main-testingVC-merged.cwl/20f07690-6792-4bf5-80af-52b937112b35/call-jointcall/shard-0/wf-jointcall.cwl/680304fd-104e-4ab4-ae35-a692ffd75652/call-run_jointvc/shard-7/inputs

The transactional directories get removed when tasks end so you don't expect those to have files. We'd need a couple of inputs gVCFs that are failing and then could run the gvcfgenotyper command outside of bcbio to try and reproduce. Thanks again for the help.

amizeranschi commented 6 years ago

Thanks for clarifying the location of the input files. I've just sent them by e-mail. I hope they will be useful.

chapmanb commented 6 years ago

Thanks for passing these along; I appreciate the help debugging. Are these the input to one of the failing steps using gvcfgenotyper? The only input VCF there is not a gVCF but rather a combined and called VCF file with two samples, so I'm confused. I wouldn't expect you to have these if the gvcfgenotyper step failed earlier. Are these the input files from the path I referenced or a different one? Also, could you share your configuration for the actual run? Maybe I'm missing something in reproducing and understanding the issue. Thanks again for all the help.

amizeranschi commented 6 years ago

Yes, those are the files from the directory you mentioned in your previous post (/home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged__failed-strelka2/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/20f07690-6792-4bf5-80af-52b937112b35/call-jointcall/shard-0/wf-jointcall.cwl/680304fd-104e-4ab4-ae35-a692ffd75652/call-run_jointvc/shard-7/inputs/).

This is the YAML template:

# Template for whole genome Illumina variant calling with GATK pipeline
---
details:
  - analysis: variant2
    genome_build: sacCer3
    algorithm:
      aligner: bwa
      mark_duplicates: true
      recalibrate: false
      realign: false
      effects: false
      tools_off: gemini
      variantcaller: strelka2
      jointcaller: strelka2-joint
      svcaller: [manta]
      ploidy: 1
      variant_regions: ../config/variant_regions.bed

This is the CSV file:

samplename,description,batch
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample1-R1-L001.fastq.gz,sample1,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample1-R1-L002.fastq.gz,sample1,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample1-R2-L001.fastq.gz,sample1,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample1-R2-L002.fastq.gz,sample1,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample2-R1-L001.fastq.gz,sample2,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample2-R1-L002.fastq.gz,sample2,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample2-R2-L001.fastq.gz,sample2,testingVC
/home/amizeranschi/bcbio-stuff/automated-VC-test/fastq-files/sample2-R2-L002.fastq.gz,sample2,testingVC

And this is the YAML config file:

details:
- algorithm:
    aligner: bwa
    effects: false
    jointcaller: strelka2-joint
    mark_duplicates: true
    ploidy: 1
    realign: false
    recalibrate: false
    svcaller:
    - manta
    tools_off: gemini
    variant_regions: ../config/variant_regions.bed
    variantcaller: strelka2
  analysis: variant2
  description: sample1
  files:
  - /home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged/sample1_R1.fastq.gz
  - /home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged/sample1_R2.fastq.gz
  genome_build: sacCer3
  metadata:
    batch: testingVC
- algorithm:
    aligner: bwa
    effects: false
    jointcaller: strelka2-joint
    mark_duplicates: true
    ploidy: 1
    realign: false
    recalibrate: false
    svcaller:
    - manta
    tools_off: gemini
    variant_regions: ../config/variant_regions.bed
    variantcaller: strelka2
  analysis: variant2
  description: sample2
  files:
  - /home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged/sample2_R1.fastq.gz
  - /home/amizeranschi/bcbio-stuff/automated-VC-test/testingVC-merged/sample2_R2.fastq.gz
  genome_build: sacCer3
  metadata:
    batch: testingVC
fc_name: testingVC-merged
upload:
  dir: ../final
amizeranschi commented 6 years ago

I tried running the same test (Yeast data + CWL with Cromwell), but I've replaced Strelka2 with GATK's HaplotypeCaller. It's been running for almost 7 hours now and I'm not sure if it's still working or just stuck somehow.

The last lines of the output so far are:

# return exit code
exit $rc
[2018-11-30 12:00:56,75] [info] WorkflowExecutionActor-254691bd-cc5e-4c38-beeb-2f3f5f8b9a4f [254691bd]: Starting normalize_sv_coverage
[2018-11-30 12:00:57,23] [info] BackgroundConfigAsyncJobExecutionActor [30fc7c81variantcall_batch_region:0:1]: job id: 10055
[2018-11-30 12:00:57,24] [info] BackgroundConfigAsyncJobExecutionActor [30fc7c81variantcall_batch_region:0:1]: Status change from - to WaitingForReturnCodeFile

Is there a way to check if this is still running or just stuck?

For comparison, the same workflow (with Strelka2) took 20 minutes when running on a local Bcbio_nextgen copy, and around an hour when running with Bcbio_vm and Docker (no CWL). Strelka2 has run aprox. 2X faster than GATK's HaplotypeCaller for me, locally, in Bcbio_nextgen (tested a couple of weeks ago).

Is it expected for Bcbio_vm+Docker+CWL to run significantly slower than Bcbio_nextgen or the older Bcbio_vm+Docker workflow (now CWL)?

chapmanb commented 6 years ago

Thanks much for the example, this was a big help to identify the issue. If you add tools_on: [gvcf] to your configuration, it should now do the right thing. In CWL you can drop the jointcaller input approach and just add this for do gVCF based joint calling. This seemed like a cleaner approach as the dual variantcaller and jointcaller approach ended up being confusing.

However, the way you specified should work and not fail, so I've also pushed fixes to make that happen. Those should be in place by tomorrow but in the short term just removing jointcaller and adding tools_on: [gvcf] should proceed as you expect.

For the GATK issue, I'm not sure if this is a symptom of the same underlying problem or if something else is happening. You should be able to use top to check if there is anything running, which is the most straightforward way. In general, we don't expect the CWL approach to be significantly slower than the older approach so taking this much longer doesn't seem right. For smaller runs on a single machine CWL might be slower depending on the core allocation, but should be relatively similar.

Hope this all helps and gets your analysis finally finished. Thanks again for all the work and patience.

amizeranschi commented 6 years ago

I've been checking htop and I can see lines showing that stuff is happening (e.g. /usr/bin/containerd or java lines running cromwell.jar). I'm just surprised at the rather long runtime for this.

Just to check, should the jointcaller option only be taken out for CWL, or is it still useful for Bcbio_nextgen and Bcbio_vm+Docker?

I'll test this tomorrow and report back.

amizeranschi commented 6 years ago

Adding the option tools_on: [gvcf] got Strelka2 to run fine for me, both with and without the jointcaller option. Furthermore, the wall time was approx. the same as the one I had with Bcbio_nextgen, which is great news (around 20 minutes). GATK's HaplotypeCaller also ran fine (only tested with tools_on: gvcf and also jointcaller: gatk-haplotype-joint) and its wall time was around 30 minutes (50% slower than Strelka2).

Is there any good reason to still use HaplotypeCaller instead of Strelka2?

One thing I noticed was that the upload directory seems to have changed when using CWL, compared with what I've expected from Bcbio_nextgen. This is what the directory structure looks like for my test:

├── testingVC-merged
│   ├── config
│   ├── final
│   ├── work
│   │   ├── cromwell_work
│   │   │   ├── cromwell-executions
│   │   │   ├── cromwell-workflow-logs
│   │   │   ├── final
│   │   │   │   ├── project
│   │   │   │   ├── sample1
│   │   │   │   ├── sample2
│   │   │   ├── persist
│   │   ├── testingVC-merged-workflow

To run both Bcbio_nextgen and Bcbio_vm+CWL, I cd into testingVC-merged/work and run either

bcbio_nextgen.py ../config/<config>.yaml -n <cores>

or

bcbio_vm.py cwlrun cromwell testingVC-merged-workflow

Bcbio_nextgen seems to set the upload directory to testingVC-merged/final. However, Bcbio_vm+CWL copies files to testingVC-merged/work/cromwell_work/final.

In addition, the name of the directory where the joint VCF file and multiqc results are copied differs between Bcbio_nextgen and Bcbio_vm+CWL. With the former, that directory was called something like <run_date>_<exp_name>, e.g. 2018-11-12_testingVC-merged in my example. Meanwhile, for Bcbio_vm+CWL, this is simply called project.

I also noticed a file called testingVC-manta.vcf.gz inside testingVC-merged/work/cromwell_work/final/project. Is this meant to be a multi-sample a joint-VCF file for sv calls, similarly to what's done for small variants called by Strelka2? The file was empty for my test case, (as I never bothered to set up a realistic test case for large variants) and I only wanted to see if Manta actually ran successfully.

WIth Bcbio_nextgen, I expected Manta VCFs to be in the testingVC-merged/final/sample1 and testingVC-merged/final/sample1 directories. With Bcbio_vm+CWL, I see that all the VCF files are in testingVC-merged/work/cromwell_work/final/project and only the BAM files and other metadata are in testingVC-merged/work/cromwell_work/final/sample1 and testingVC-merged/work/cromwell_work/final/sample2.

What's the plan with the two workflows (Bcbio_nextgen and Bcbio_vm+CWL)? Are they going to diverge in terms of directory set-up and output file locations? Is Bcbio_vm+CWL going to replace Bcbio_nextgen in the future?

chapmanb commented 5 years ago

It's brilliant to hear this is now working for you. Thanks for all the patience getting it running and for the usage report. In terms of GATK HaplotypeCaller and strelka2, both are good choices for joint variant calling. They do have different intermediate gVCF formats so you might also make a decision based on other platforms or analyses you want to integrate with.

Thanks also for the detailed investigation of the output directory. It's expected to be non-identical for the CWL case. Replicating some of what we did previously was difficult since we get the results in a different manner from the CWL runner so can't allocate to sample and project directories in the exact same way. We were also hoping to clean up some of the historical cruft and confusion in the upload directory. Please let us know if anything is problematic and we can investigate.

For manta, you're right that is should be a merged VCF of the batched samples. It's not joint called, but rather called together within a group. Hope that works as expected when you're able to put together a realistic case for it.

Thanks again for all the feedback and work on getting this going.

amizeranschi commented 5 years ago

OK, great. Thanks for all your help!