Resource overutilization with nanodisco difference

cjfields commented 2 years ago

Hi, I wanted to report an issue when running nanodisco difference, which appears related to #18. We have a large compute node on a SLURM cluster with 72 cores and 2TB memory. When allocating 24 cores and running nanodisco difference using Singularity as follows:

nanodisco difference \
  -nj 1 -nc 4 -p 4 -f 1 -l 4 \
  -i preprocessed \
  -o difference_subset \
  -w _WGA \
  -n sample6_NAT \
  -r sample6.fasta

And at the point the following output is generated

[1] "2021-12-08 21:44:21 CST"
[1] "Processing chunk #1"
[1] "  Preparing 6_iAB4340006_iAB4260006_WGA input data for chunk #1"
[1] "  Preparing 6_iAB4340006_iAB4260006_NAT input data for chunk #1"
[1] "  Correcting mapping."
[1] "  Normalization."

On the Normalization step, I am seeing four R processes running in parallel (which appears correct), but each of the processes seem to be competing for all 24 cores in the SLURM allocation which is driving the server load up significantly as well as appears to be slowing down computation dramatically. I suspect the issue is in normalize.data.parallel, but nothing in particular stands out. We could possibly debug this using a singularity build environment, but do you have any suggestions for a workaround?

I should add, the chunk does eventually finish and the other processes seem fine with CPU utilization (memory isn't an issue); however most HPC systems I work on would have these jobs killed by cluster admins. Any help would be greatly appreciated.

cjfields commented 2 years ago

I just tried a new job with a 24 core allocation via SLURM with lower settings:

nanodisco difference \
  -nj 2 -nc 1 -p 2 -f 1 -l 4 \
  ...

It's running two R processes in parallel using -nj 2, but then each (when getting to the normalize step) splits into two forked processes (-p 2) and starts eating up cores again. Here is a snapshot:

EDIT: also note the load avg is very high; other users are on this node.

cjfields commented 2 years ago

I've been doing a bit of followup on this, and I think I've found a workaround in case others run into the same issue for the time being. It does look like there are several steps in nanodisco difference which could be possibly optimized more (bwa and nanopolish steps) but the main bottleneck seems to be in the normalization step.

Basically it looks like (at least for the normalization step) there is a step where registerDoMC is called with the number of threads setting (-p);

https://github.com/fanglab/nanodisco/blob/a2569323c2273bb0230813aa222b76e60b1bbd34/code/difference_functions.R#L837

This however runs the -p number of forks, each fork which apparently can be multi-threaded (it's not obvious where this is occurring). I suspect each fork is requesting whatever threads are available; I would need to check this interactively using the our node has 72, with 24 allocated for this job by SLURM.

I think this can be pinned down internally within R, but one way around this in Linux is to set OpenMP environment variables to limit the total threads and the per-process threads. When I set this (note that the product of OMP_NUM_THREADS and -p is the allocated # of threads, 24):

export OMP_THREAD_LIMIT=$SLURM_NPROCS # 24 allocated 
export OMP_NUM_THREADS=4

nanodisco difference  -nj 1 -nc 2 -p 6 -f 1 -l 2 \
...

The load and core usage is much more manageable.

I did also notice there is at least one more step ('Computing stats by genomic position.') that also spawns forks, but this doesn't seem to have the same issue as the above.

touala commented 2 years ago

Hi @cjfields,

Thank you very much for the detailed reports and for digging into to issue further.

If I understand correctly, this can become problematic when nanodisco is run outside of a job scheduler, and eat up too much ressources. This is indeed annoying, but I don't see an obvious explanation so I'll need to investigate. I'll add it on my todo list for the next update, and hopefully I can find and fix the root cause. I'll leave the issue open until then.

Alan

cjfields commented 2 years ago

@touala partly; the above was run with an interactive SLURM job, where I believe the number of cores is pinned to the interactive session. When I managed to get a wrapper that runs nanodisco difference as a batch script using a job array, I needed to ensure that the SLURM CPUs per task was set for the OpenMP environment settings to work as expected.

It would be interesting to see if this is something that others have run into.

cjfields commented 2 years ago

@touala just a note that we did manage to get (nice!) results from this run. On SLURM we ended up splitting the job into chunks and running them with -p 6, giving the job one extra core for the parent R task. We used the below settings (lots of memory was necessary):

#!/bin/bash
# ----------------SLURM Parameters----------------
#SBATCH -J nd_diff
#SBATCH --mem=120g
#SBATCH --ntasks=1
#SBATCH -N 1 #nodes
#SBATCH --cpus-per-task=7
#SBATCH -p hpcbio #queue
# ----------------Load Modules--------------------
module load nanodisco

export OMP_THREAD_LIMIT=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=1
...

fanglab / nanodisco

Resource overutilization with nanodisco difference #33