bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

Possible memory leak in bcbio_nextgen.py #1918

Closed matthdsm closed 7 years ago

matthdsm commented 7 years ago

Hi Brad,

We're experiencing a somewhat weird issue. We run bcbio with iPython on a torque cluster, where the bcbio_nextgen.py command is run as a single core job, which then spawns the iPython jobs. This "master" job seems to be using an excessive amount of memory, which keep rising untill the worker node runs out and the job is killed.

Any idea what could be causing this? Currently, the memory usage is stable at about 8Gb, which seems to be a tad too much I think.

Thanks M

roryk commented 7 years ago

Hi Matthias,

Which pipeline are you seeing this behavior?

matthdsm commented 7 years ago

Just the default "variant2" pipeline. We're using v1.0.2 with the following config

#bcbio-nextgen v1.0.2
---
#include an experiment name here
fc_name:
upload:
  dir: ../final
globals:
  analysis_regions: RefSeq_allexons_20bp.sorted.merged.bed
resources:
  tmp:
    dir: /tmp/bcbio
details:
  - analysis: variant2
    genome_build: hg38
    description:
    metadata:
      batch:
    algorithm:
      aligner: bwa
      save_diskspace: true
      coverage_interval: regional
      mark_duplicates: true
      recalibrate: false
      realign: false
      variantcaller: gatk-haplotype
      variant_regions: analysis_regions
      jointcaller: gatk-haplotype-joint
      effects: vep
      effects_transcripts: all
      vcfanno: [gemini,../config/eog.conf,../config/jpopgen.conf]
      tools_on:
        - vep_splicesite_annotations
      # add the path to your files here
      files:

Thanks for looking into this. M

chapmanb commented 7 years ago

Matthias; Thanks for reporting the issue and for the details. How many samples are you running concurrently? Memory usage will be dependent on that since bcbio builds record objects to pass for parallelization. During highly parallel steps like variant calling this can be a lot of objects and the memory usage can get high. Could this explain what you're seeing?

Apologies, I know this isn't ideal for continuing to scale up. This is one of the motivations for moving to CWL where we can use more scalable infrastructures for handling these sorts of issues.

matthdsm commented 7 years ago

Hi Brad, This is a run containing 120 exomes (first time we're doing a run this big). During our previous runs (about 48 samples) we've had no issues, so it can very well be the number of samples that's throwing things off.

It's not much of a problem now, but good thing to know for future use. If we have some bigger sample sets, we can batch them. I don't know if this limitation is mentioned somewhere in the docs, but it might be worth it to add.

Thanks for looking in to it! M

matthdsm commented 7 years ago

Hi Brad,

Just a little follow up on this. Suppose we're looking into transitioning to CWL, which runner would you advise? We have a cluster running torque/PBS, so Arvados is out of the question, and Toil doensn't support torque. Any other proposition?

Thanks M

chapmanb commented 7 years ago

Matthias; We're currently working on torque/PBS support for Toil but it is not quite there yet. Apologies, the CWL work is still under active development and not quite ready for production use right now. Thanks for checking in on it and we'll keep moving things forward.

matthdsm commented 7 years ago

Hi Brad, no problem, I just wanted to know how things stand as of now. I'll close this issue, since the problem was clearly caused by the number of samples.

Cheers, M