ai4d-iasc / trixie

Scripts and documentation about trixie hpc
17 stars 4 forks source link

UCX Errors related to ulimit soft memlock set to 64K #81

Closed nrcfieldsa closed 2 years ago

nrcfieldsa commented 2 years ago

Certain MPI jobs run on trixie are now getting memory allocation errors during MPI Init related to the UCX/Infiniband drivers and a very small maximum locked memory limit being imposed artificially in the system following upgrade. The system-wide settings for /etc/secuirty/limits.conf are set as follows (but are being ignored by systemd):

* soft memlock unlimited
* hard memlock unlimited

It was seen that jobs that have UCX errors do not presently start-up unless the following line is used at the top of the job file: ulimit -l unlimited or similar.

Adding: ulimit -S -a at the top of the job file also shows that even though the system is configured to use unlimited for max memlock, the soft memlock setting is being set to 64K during slurm creating the batch job shell process. Further troublehsooting shows that the job runs fine when this value is forced to unlimited. Similar reports online show other OpenMPI users have this issue and the recommended fix is that it is necessary to setup unlimited memlock in order for Infiniband drivers to allocate their completion queue.

nrcfieldsa commented 2 years ago

A bit of investigation of this problem in OpenMPI on recent Linux hosts:

- /etc/security/limits.conf is ignored by systemd on purpose
    - values listed there do not take effect in services started by systemd
    - systemd also seems to ignore the service-specific /ur/lib/systemd/system/slurmd.service LimitMEMLOCK=infinity   # is that a bug, or it's later just over-ride
    - after the job script is started it already has ulimit -S -l 64 in effect even though the system-wide settings would grant ulimit -l unlimited
    - something _________ is setting the 64 KBytes value in the system prior to slurmd inheriting it, or perhaps after slurmd starts the batch shell child process
- user workaround#1: umlimit -l unlimited in job file
- user workaround#2: Set job file:  #!/bin/bash -l   # login shell which calls the system/user profile and finds any ulimit lines
- user workaround#3: User has a wrapper script which first calls ulimit -l unlimited, prior to starting their program command -- which is started on each compute node from srun or mpirun. 
- propose.sol#1: update /etc/systemd/system.conf to include line: DefaultLimitMEMLOCK=infinity , then reboot host
- propose.sol#2: update /etc/slurm/slurm.conf to include line: PropogateResourceLimitsExcept=MEMLOCK

It may be desired to have a different limit for MEMLOCK on root/system users than regular users - or to set a very high limit for regular users below available userspace job RAM - in order to prevent a runaway process from impacting node performance.

nrcfieldsa commented 2 years ago

It is also mentioned in a slide deck for computecanada that one can set a ulimit prior to srun as follows (since SLURM will inherit the environment from the calling shell):

ulimit -l unlimited
srun -n nodes command args

That may eliminate any need for a wrapper script.

nrcfieldsa commented 2 years ago

Resolving this issue as it is not specific to trixie cluster, but appears upstream in recent release of RedHat/CentOS Linux with OpenMPI - solution: ulimit -l unlimited