resources assignment when perform parallel jobs

Version info

bcbio version (bcbio_nextgen.py --version):1.2.9
OS name and version (lsb_release -ds): Ubuntu 20.04.5 LTS

To Reproduce Exact bcbio command you have used:

 nohup bcbio_nextgen.py ../config/my_project.yaml  -n 64 &

Your yaml configuration file:

- algorithm:
    align_split_size: false
    aligner: bwa
    coverage_interval: regional
    ensemble:
      numpass: 2
    exclude_regions: lcr
    svcaller:
    - manta
    - cnvkit
    variant_regions: /home/data/bcbio/genomes/Hsapiens/hg38/coverage/capture_regions/Exome-Agilent_V6.bed
    variantcaller:
      germline:
      - freebayes
      - gatk-haplotype
      - strelka2
      somatic:
      - vardict
      - mutect2
      - strelka2
  analysis: variant2
  description: sample8
  files:
  - /home/data/bcbio/projects/input/S10_L4_518_R1.fastq.gz
  - /home/data/bcbio/projects/input/S10_L4_518_R2.fastq.gz
  genome_build: hg38
  metadata:
    batch: MatchWith_sample8
    phenotype: tumor
    prep_method: 300x
    tissue: tissue
resources:
    bamsormadup:
      cores: 8
      memory: 2G
    bwa:
      cores: 8
      memory: 2G
    gatk:
      jvm_opts:
      - -Xms2g
      - -Xmx4g
    genome:
      dir: /home/data/bcbio/genomes/Hsapiens/hg38
    samtools:
      cores: 16
      memory: 2G

Supposably, when I set the number of all available cores as -n 64 with the setup in my yaml file shown above, each job would occupy only 8 cores to perform bwa mem. However, when I checked the log files, both the debug-log and command log showed that the resources were not deployed as I wished. Besides, the pipeline repeatedly threw error indicating " Segmentation fault (core dumped) ", as is shown below. I have no idea how this happened and what should I do to fix it , could you please help me with this problem? Thanks~

Log files (could be found in work/log)

debug-log

[2023-11-15T06:04Z] System YAML configuration: /home/data/bcbio/galaxy/bcbio_system.yaml.
[2023-11-15T06:04Z] Locale set to C.UTF-8.
[2023-11-15T06:04Z] Resource requests: bwa, sambamba, samtools; memory: 2.00, 6.00, 2.00; cores: 8, 32, 16
[2023-11-15T06:04Z] Configuring 1 jobs to run, using 32 cores each with 192.1g of memory reserved for each job
[2023-11-15T06:04Z] Timing: organize samples
[2023-11-15T06:04Z] multiprocessing: organize_samples

command-log

[2023-11-15T06:05Z] unset JAVA_HOME && /home/data/bcbio/galaxy/../anaconda/bin/bwa mem   -c 250 -M -t 32  -R '@RG\tID: sample8\tPL:illumina\tPU:sample8\tSM:sample8' -v 1 /home/data/bcbio/genomes/Hsapiens/hg38/bwa/hg38.fa /home/data/bcbio/projects/work/align_prep/sample8_S38_L3_543_R1.fastq.gz /home/data/bcbio/projects/work/align_prep/sample8_S38_L3_543_R2.fastq.gz  | /home/data/bcbio/galaxy/../anaconda/bin/bamsormadup inputformat=sam threads=24 tmpfile=/home/data/bcbio/projects/work/bcbiotx/tmpeva3dfj4/sample8-sort-sorttmp-markdup SO=coordinate indexfilename=/home/data/bcbio/projects/twin_somatic/twin_somatic/work/bcbiotx/tmpeva3dfj4/sample8-sort.bam.bai > /home/data/bcbio/projects/work/bcbiotx/tmpeva3dfj4/sample8-sort.bam

Segmentation fault error

     2397570 Segmentation fault      (core dumped) | /home/data/bcbio/galaxy/../anaconda/bin/bamsormadup inputformat=sam threads=12 tmpfile=/home/data/bcbio/projects/work/bcbiotx/tmp0716fu54/sample8-sort-sorttmp-markdup SO=coordinate indexfilename=/home/data/bcbio/projects/work/bcbiotx/tmp0716fu54/sample8-sort.bam.bai > /home/data/bcbio/projects/work/bcbiotx/tmp0716fu54/sample8-sort.bam

Hi @wangpenhok !

I suspect that here you have an indentation issue: you have 4 spaces instead of 2 after resources, and you specifications have not been parsed.

For a one-node non-distributed run, bcbio's logic in allocating resources with (-n 64) is

try to run all tools with 64 cores, if memory spec is allows for that
for example the default spec says 4G/ core, so 64 would need 64 x 4G = 256G RAM. If your server does not have this amount of RAM, bcbio is trying to decrease ncores, next would be 32 cores x 128G RAM
https://github.com/bcbio/bcbio-nextgen/blob/master/config/bcbio_system.yaml

After these calculations, bcbio uses: 32 cores each with 192.1g

When bcbio runs a pipe, it accounts for the fact that every command in the pipe consumes RAM, so it has to decrease cores to fit into the RAM which happened in the command:

bwa mem -t 32 | bamsormadup threads=24

Still, these values are very high for this server. The memory is also consumed for the IO buffers. You need to try running bcbio with -n 7 or -n10, maximum with -n20.

Large core numbers -n only make sense in a distributed bcbio runs, when these cores are requested across many servers.

bcbio / bcbio-nextgen

resources assignment when perform parallel jobs #3727