slurmstepd: error: Detected 1 oom_kill event in StepId=440575.0. Some of the step tasks have been OOM Killed.
srun: error: nid001277: tasks 1-3: Out Of Memory
srun: Terminating StepId=440575.0
slurmstepd: error: STEP 440575.0 ON nid001277 CANCELLED AT 2023-12-05T07:16:01
Logging on to node and using:
nid001310:~$ free -lm
total used free shared buff/cache available
Mem: 257088 131148 135934 1120 2517 125939
Low: 257088 121153 135934
High: 0 0 0
Swap: 0
shows memory use is increasing. I have not seen this behavior on other systems.
slurmstepd: error: Detected 1 oom_kill event in StepId=440575.0. Some of the step tasks have been OOM Killed. srun: error: nid001277: tasks 1-3: Out Of Memory srun: Terminating StepId=440575.0 slurmstepd: error: STEP 440575.0 ON nid001277 CANCELLED AT 2023-12-05T07:16:01
Logging on to node and using: nid001310:~$ free -lm total used free shared buff/cache available Mem: 257088 131148 135934 1120 2517 125939 Low: 257088 121153 135934 High: 0 0 0 Swap: 0
shows memory use is increasing. I have not seen this behavior on other systems.