Open cjfields opened 2 years ago
I just tried a new job with a 24 core allocation via SLURM with lower settings:
nanodisco difference \
-nj 2 -nc 1 -p 2 -f 1 -l 4 \
...
It's running two R processes in parallel using -nj 2
, but then each (when getting to the normalize step) splits into two forked processes (-p 2
) and starts eating up cores again. Here is a snapshot:
EDIT: also note the load avg is very high; other users are on this node.
I've been doing a bit of followup on this, and I think I've found a workaround in case others run into the same issue for the time being. It does look like there are several steps in nanodisco difference
which could be possibly optimized more (bwa
and nanopolish
steps) but the main bottleneck seems to be in the normalization step.
Basically it looks like (at least for the normalization step) there is a step where registerDoMC
is called with the number of threads setting (-p
);
This however runs the -p
number of forks, each fork which apparently can be multi-threaded (it's not obvious where this is occurring). I suspect each fork is requesting whatever threads are available; I would need to check this interactively using the our node has 72, with 24 allocated for this job by SLURM.
I think this can be pinned down internally within R, but one way around this in Linux is to set OpenMP environment variables to limit the total threads and the per-process threads. When I set this (note that the product of OMP_NUM_THREADS
and -p
is the allocated # of threads, 24):
export OMP_THREAD_LIMIT=$SLURM_NPROCS # 24 allocated
export OMP_NUM_THREADS=4
nanodisco difference -nj 1 -nc 2 -p 6 -f 1 -l 2 \
...
The load and core usage is much more manageable.
I did also notice there is at least one more step ('Computing stats by genomic position.') that also spawns forks, but this doesn't seem to have the same issue as the above.
Hi @cjfields,
Thank you very much for the detailed reports and for digging into to issue further.
If I understand correctly, this can become problematic when nanodisco
is run outside of a job scheduler, and eat up too much ressources. This is indeed annoying, but I don't see an obvious explanation so I'll need to investigate. I'll add it on my todo list for the next update, and hopefully I can find and fix the root cause. I'll leave the issue open until then.
Alan
@touala partly; the above was run with an interactive SLURM job, where I believe the number of cores is pinned to the interactive session. When I managed to get a wrapper that runs nanodisco difference
as a batch script using a job array, I needed to ensure that the SLURM CPUs per task was set for the OpenMP environment settings to work as expected.
It would be interesting to see if this is something that others have run into.
@touala just a note that we did manage to get (nice!) results from this run. On SLURM we ended up splitting the job into chunks and running them with -p 6
, giving the job one extra core for the parent R task. We used the below settings (lots of memory was necessary):
#!/bin/bash
# ----------------SLURM Parameters----------------
#SBATCH -J nd_diff
#SBATCH --mem=120g
#SBATCH --ntasks=1
#SBATCH -N 1 #nodes
#SBATCH --cpus-per-task=7
#SBATCH -p hpcbio #queue
# ----------------Load Modules--------------------
module load nanodisco
export OMP_THREAD_LIMIT=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=1
...
Hi, I wanted to report an issue when running
nanodisco difference
, which appears related to #18. We have a large compute node on a SLURM cluster with 72 cores and 2TB memory. When allocating 24 cores and runningnanodisco difference
using Singularity as follows:And at the point the following output is generated
On the Normalization step, I am seeing four R processes running in parallel (which appears correct), but each of the processes seem to be competing for all 24 cores in the SLURM allocation which is driving the server load up significantly as well as appears to be slowing down computation dramatically. I suspect the issue is in
normalize.data.parallel
, but nothing in particular stands out. We could possibly debug this using asingularity build
environment, but do you have any suggestions for a workaround?I should add, the chunk does eventually finish and the other processes seem fine with CPU utilization (memory isn't an issue); however most HPC systems I work on would have these jobs killed by cluster admins. Any help would be greatly appreciated.