How to troubleshoot slowed down run times?

sxv commented 8 years ago

I've been running tumor/normal paired exome cases through our bcbio pipeline and a couple weeks ago, without intentionally changing any settings, the runs went from taking 4 hours to ~37 hours! The alignment post-processing is taking the most time (~12hrs) but all of the steps are taking longer. I've restarted the machine, I'm using the same command to start the runs, the config files are nearly identical, and the output files show no significant differences (similar # of reads and variants called).

The only guesses I have are: cores are not being utilized correctly one of my hard drives is failing *some other system issue

I wanted to post some information from logs here to see if there were any hints that someone might find to help troubleshoot. Any advice greatly appreciated, thanks!

Command to start run: bcbio_nextgen.py -n 40 /media/NPC.HK.11QM0009T-DNA.11QM0009/config/NPC.HK.11QM0009T-DNA.11QM0009.yaml

BEFORE (completion in 4 hours): First line of bcbio-nextgen-commands.log:

[2016-04-21T01:32Z] gunzip -c /media/NPC_WES/11QM0003-T_1.fastq.gz | /media/SSD-Data/bcbio/bin/pbgzip -n 16  -c /dev/stdin > /media/SSD-Data/NPC.HK/NPC.HK.11QM0003-T.11QM0003/tmp/bcbiotx/f74e1dbd-0799-4843-8a52-c564546e9537/tmpLVWIkP/11QM0003-T_1.fastq.gz

bcbio-nextgen.log:

[2016-04-21T01:32Z] Timing: organize samples
[2016-04-21T01:32Z] Timing: alignment preparation
[2016-04-21T01:42Z] Timing: alignment
[2016-04-21T02:06Z] Timing: callable regions
[2016-04-21T02:56Z] Timing: structural variation initial
[2016-04-21T02:56Z] Timing: hla typing
[2016-04-21T02:56Z] Timing: alignment post-processing
[2016-04-21T03:31Z] Timing: variant calling
[2016-04-21T05:04Z] Timing: joint squaring off/backfilling
[2016-04-21T05:04Z] Timing: variant post-processing
[2016-04-21T05:12Z] Timing: prepped BAM merging
[2016-04-21T05:17Z] Timing: validation
[2016-04-21T05:17Z] Timing: ensemble calling
[2016-04-21T05:17Z] Timing: validation summary
[2016-04-21T05:17Z] Timing: structural variation final
[2016-04-21T05:17Z] Timing: structural variation ensemble
[2016-04-21T05:17Z] Timing: structural variation validation
[2016-04-21T05:17Z] Timing: heterogeneity
[2016-04-21T05:17Z] Timing: population database
[2016-04-21T05:23Z] Timing: quality control
[2016-04-21T05:31Z] Timing: archive
[2016-04-21T05:31Z] Timing: upload

AFTER (completion in 37 hours): First line of bcbio-nextgen-commands.log:

[2016-04-26T01:03Z] gunzip -c /media/NPC_WES/11QM0031T_1.fastq.gz | /media/SSD-Data/bcbio/bin/pbgzip -n 16  -c /dev/stdin > /media//SSD-Data/NPC.HK/NPC.HK.11QM0031T.11QM0031/tmp/bcbiotx/21fd43a2-727b-4f3d-a2a8-0adbfee10e8c/tmpHxSw72/11QM0031T_1.fastq.gz

bcbio-nextgen.log:

[2016-04-26T01:03Z] Timing: organize samples
[2016-04-26T01:03Z] Timing: alignment preparation
[2016-04-26T01:12Z] Timing: alignment
[2016-04-26T02:06Z] Timing: callable regions
[2016-04-26T03:41Z] Timing: structural variation initial
[2016-04-26T03:41Z] Timing: hla typing
[2016-04-26T03:41Z] Timing: alignment post-processing
[2016-04-26T15:18Z] Timing: variant calling
[2016-04-27T13:49Z] Timing: joint squaring off/backfilling
[2016-04-27T13:49Z] Timing: variant post-processing
[2016-04-27T13:56Z] Timing: prepped BAM merging
[2016-04-27T14:03Z] Timing: validation
[2016-04-27T14:03Z] Timing: ensemble calling
[2016-04-27T14:03Z] Timing: validation summary
[2016-04-27T14:03Z] Timing: structural variation final
[2016-04-27T14:03Z] Timing: structural variation ensemble
[2016-04-27T14:03Z] Timing: structural variation validation
[2016-04-27T14:03Z] Timing: heterogeneity
[2016-04-27T14:03Z] Timing: population database
[2016-04-27T14:09Z] Timing: quality control
[2016-04-27T14:19Z] Timing: archive
[2016-04-27T14:19Z] Timing: upload

Config file (identical in both cases other than input file paths):

--- 
fc_date: 20160425
fc_name: NPC.HK.11QM0031T.11QM0031
upload:
  dir: /media/drive2/exome.complete/
resources:
  tmp:
    dir: /media/SSD-Data//NPC.HK/NPC.HK.11QM0031T.11QM0031/tmp
details:
  - analysis: variant2
    genome_build: hg19
    metadata:
       batch: NPC.HK.11QM0031T.11QM0031
       phenotype: tumor
    description: 11QM0031T.tumor
    files:
      - /media/NPC_WES/11QM0031T_1.fastq.gz
      - /media/NPC_WES/11QM0031T_2.fastq.gz
    algorithm:
      aligner: bwa
      #coverage_interval: exome
      mark_duplicates: true
      recalibrate: True
      realign: true
      platform: illumina
      tools_off: [gemini]
      variantcaller: [mutect, varscan, freebayes, vardict-java]
      ensemble:
        numpass: 2
      #svcaller: [lumpy, cnvkit, manta, metasv]
  - analysis: variant2
    genome_build: hg19
    metadata:
       batch: NPC.HK.11QM0031T.11QM0031
       phenotype: normal
    description: 11QM0031.normal
    files:
      - /media/NPC_WES/11QM0031_1.fastq.gz
      - /media/NPC_WES/11QM0031_2.fastq.gz
    algorithm:
      aligner: bwa
      #coverage_interval: exome
      mark_duplicates: true
      recalibrate: True
      realign: true
      platform: illumina
      tools_off: [gemini]
      variantcaller: [mutect, varscan, freebayes, vardict-java]
      ensemble:
        numpass: 2
      #svcaller: [lumpy, cnvkit, manta, metasv]

ohofmann commented 8 years ago

Just wondering.. is this just with this specific set of input files? Are the previous input files still around and could you use them for a quick sanity check? Before trying to troubleshoot the system it might make sense that there isn't something off with the new samples, e.g., a number of regions with extremely high coverage that bcbio might be struggling with.

The only other thing that comes to mind is a network issue (assuming /media is mounted via the network).

chapmanb commented 8 years ago

Sorry about the problems, that is unexpected for sure. In addition to Oliver's questions about inputs, is this with the same version of bcbio or have you done updates between the two runs? It would be helpful to understand the differences between the two input files causing the different run times.

From the timings and command lines posted I don't see anything obviously wrong, it just looks like the second run takes longer at most steps, except alignment (which is strange). Is it possible the first run had poor mapping so subsequent steps were quick, while the second has more reads/better mapping? Alternatively, it could be that the filesystem was busy or under load during the second run since the worst slowdowns appear to happen during IO intensive steps like realignment and variantcalling (VarScan uses a lot of disk).

Hope this helps some with debugging.

sxv commented 8 years ago

This problem is not due to the specific set of input data, as I've found by running the exact same files that performed fine before and are now running 8x slower. So it seems this must be a system or disk issue. I don't believe bcbio was updated during this period either (programs.txt shows no difference).

One question: I typically add the option "-n 40" or "-n 36" when running bcbio_nextgen.py to specify using 40 or 36 cores, but I'm not actually seeing these numbers reflected in bcbio-nextgen-commands.log. Instead, as you can see in the example above, I'm seeing "-n 16" included in several commands, like pbgzip. Does this indicate that "-n 40" is not working properly?

chapmanb commented 8 years ago

If the exact same bcbio and samples are running much slower, it does point to some kind of system issue. Sorry to not be able to pinpoint exactly but hopefully you can figure out what is going on and get things running quickly again.

The -n 40 versus 16 core usage is the difference between total cores (40) and those assigned to use per sample (16). bcbio allows you to specify these separately to handle multiple samples running in parallel where it's likely better to, say, have 4 samples each using 10 cores than 1 sample at a time with 40 cores. If you're running single samples you probably want to maximize cores and can set the usage to 40 cores in your bcbio_system.yaml. There is more details in the documentation here:

http://bcbio-nextgen.readthedocs.io/en/latest/contents/parallel.html#tuning-core-and-memory-usage

Hope this helps get things running quickly for you.

sxv commented 8 years ago

Thanks for the information. I have been running some test cases on different drives / servers, and I noticed that the server in question produces this output when starting:

[2016-05-26T21:21Z] Configuring 2 jobs to run, using 16 cores each with 32.1g of memory reserved for each job
/media/SSD-Data/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py:86: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1
  for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):

Could this be related to the issues I'm experiencing?

sxv commented 8 years ago

The more I look into it the more it seems that the warning in previous comment is in fact causing the slow down. Now I need to figure out why this is happening on one server and not the other, when run with the same command and config. Any ideas? The warning does not appear in the logs, only in stdout, which is why I couldn't find it when debugging -- perhaps these warnings should be included in bcbio-nextgen.log?

Thanks.

chapmanb commented 8 years ago

This is a limitation of running on single machines as python doesn't allow nesting of multicore jobs, so you can't run a top level multicore job (2 16 core jobs), then parallelization inside of that (in this case, likely calculating callable regions). This isn't machine specific and should be unrelated to seeing differences in runtimes across machines, assuming you're running them in the same way on both. This is one of the motivations to moving to a flattened parallelization structure in common workflow language.

Sorry this is not a helpful step in debugging your current problem. Are you running jobs in the same way on both servers?

sxv commented 8 years ago

Yes the jobs are run in the same way on both servers. The warning only appears on the problematic server, and it is displayed at the beginning of the run:

[2016-05-27T22:23Z] System YAML configuration: /media/SSD-Data/bcbio/galaxy/bcbio_system.yaml
[2016-05-27T22:23Z] Resource requests: bwa, sambamba, samtools; memory: 2.00, 2.00; cores: 16, 16, 16
/media/SSD-Data/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py:86: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1
  for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
[2016-05-27T22:23Z] Configuring 2 jobs to run, using 16 cores each with 32.1g of memory reserved for each job
[2016-05-27T22:23Z] Timing: organize samples
[2016-05-27T22:23Z] multiprocessing: organize_samples

sxv commented 8 years ago

Issue resolved. Disk space on boot drive was running low and even though this disk was not used for bcbio I/O it was causing system lag.

roryk commented 8 years ago

Thanks for following up sujay, its super helpful to know what the problem was in case someone else encounters it.

bcbio / bcbio-nextgen

How to troubleshoot slowed down run times? #1403