chrisknewman / tusas

Other
18 stars 7 forks source link

out of memory errors on rocinante #149

Open chrisknewman opened 11 months ago

chrisknewman commented 11 months ago

slurmstepd: error: Detected 1 oom_kill event in StepId=440575.0. Some of the step tasks have been OOM Killed. srun: error: nid001277: tasks 1-3: Out Of Memory srun: Terminating StepId=440575.0 slurmstepd: error: STEP 440575.0 ON nid001277 CANCELLED AT 2023-12-05T07:16:01

Logging on to node and using: nid001310:~$ free -lm total used free shared buff/cache available Mem: 257088 131148 135934 1120 2517 125939 Low: 257088 121153 135934 High: 0 0 0 Swap: 0

shows memory use is increasing. I have not seen this behavior on other systems.

chrisknewman commented 11 months ago

Similar fixes need to be done in other branches