bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
MIT License
994 stars 353 forks source link

resources assignment when perform parallel jobs #3727

Open wangpenhok opened 1 year ago

wangpenhok commented 1 year ago

Version info

To Reproduce Exact bcbio command you have used:

 nohup bcbio_nextgen.py ../config/my_project.yaml  -n 64 &

Your yaml configuration file:

- algorithm:
    align_split_size: false
    aligner: bwa
    coverage_interval: regional
      numpass: 2
    exclude_regions: lcr
    - manta
    - cnvkit
    variant_regions: /home/data/bcbio/genomes/Hsapiens/hg38/coverage/capture_regions/Exome-Agilent_V6.bed
      - freebayes
      - gatk-haplotype
      - strelka2
      - vardict
      - mutect2
      - strelka2
  analysis: variant2
  description: sample8
  - /home/data/bcbio/projects/input/S10_L4_518_R1.fastq.gz
  - /home/data/bcbio/projects/input/S10_L4_518_R2.fastq.gz
  genome_build: hg38
    batch: MatchWith_sample8
    phenotype: tumor
    prep_method: 300x
    tissue: tissue
      cores: 8
      memory: 2G
      cores: 8
      memory: 2G
      - -Xms2g
      - -Xmx4g
      dir: /home/data/bcbio/genomes/Hsapiens/hg38
      cores: 16
      memory: 2G

Supposably, when I set the number of all available cores as -n 64 with the setup in my yaml file shown above, each job would occupy only 8 cores to perform bwa mem. However, when I checked the log files, both the debug-log and command log showed that the resources were not deployed as I wished. Besides, the pipeline repeatedly threw error indicating " Segmentation fault (core dumped) ", as is shown below. I have no idea how this happened and what should I do to fix it , could you please help me with this problem? Thanks~

Log files (could be found in work/log)


[2023-11-15T06:04Z] System YAML configuration: /home/data/bcbio/galaxy/bcbio_system.yaml.
[2023-11-15T06:04Z] Locale set to C.UTF-8.
[2023-11-15T06:04Z] Resource requests: bwa, sambamba, samtools; memory: 2.00, 6.00, 2.00; cores: 8, 32, 16
[2023-11-15T06:04Z] Configuring 1 jobs to run, using 32 cores each with 192.1g of memory reserved for each job
[2023-11-15T06:04Z] Timing: organize samples
[2023-11-15T06:04Z] multiprocessing: organize_samples


[2023-11-15T06:05Z] unset JAVA_HOME && /home/data/bcbio/galaxy/../anaconda/bin/bwa mem   -c 250 -M -t 32  -R '@RG\tID: sample8\tPL:illumina\tPU:sample8\tSM:sample8' -v 1 /home/data/bcbio/genomes/Hsapiens/hg38/bwa/hg38.fa /home/data/bcbio/projects/work/align_prep/sample8_S38_L3_543_R1.fastq.gz /home/data/bcbio/projects/work/align_prep/sample8_S38_L3_543_R2.fastq.gz  | /home/data/bcbio/galaxy/../anaconda/bin/bamsormadup inputformat=sam threads=24 tmpfile=/home/data/bcbio/projects/work/bcbiotx/tmpeva3dfj4/sample8-sort-sorttmp-markdup SO=coordinate indexfilename=/home/data/bcbio/projects/twin_somatic/twin_somatic/work/bcbiotx/tmpeva3dfj4/sample8-sort.bam.bai > /home/data/bcbio/projects/work/bcbiotx/tmpeva3dfj4/sample8-sort.bam

Segmentation fault error

     2397570 Segmentation fault      (core dumped) | /home/data/bcbio/galaxy/../anaconda/bin/bamsormadup inputformat=sam threads=12 tmpfile=/home/data/bcbio/projects/work/bcbiotx/tmp0716fu54/sample8-sort-sorttmp-markdup SO=coordinate indexfilename=/home/data/bcbio/projects/work/bcbiotx/tmp0716fu54/sample8-sort.bam.bai > /home/data/bcbio/projects/work/bcbiotx/tmp0716fu54/sample8-sort.bam
naumenko-sa commented 11 months ago

Hi @wangpenhok !

I suspect that here you have an indentation issue: you have 4 spaces instead of 2 after resources, and you specifications have not been parsed.

For a one-node non-distributed run, bcbio's logic in allocating resources with (-n 64) is

After these calculations, bcbio uses: 32 cores each with 192.1g

When bcbio runs a pipe, it accounts for the fact that every command in the pipe consumes RAM, so it has to decrease cores to fit into the RAM which happened in the command:

bwa mem -t 32 | bamsormadup threads=24

Still, these values are very high for this server. The memory is also consumed for the IO buffers. You need to try running bcbio with -n 7 or -n10, maximum with -n20.

Large core numbers -n only make sense in a distributed bcbio runs, when these cores are requested across many servers.