Open aimarz opened 1 month ago
@aimarz can you post a reproducer and the whole log?
Yes, this seems similar to what I've observed on 15B / 24.07. I've noticed that locally, a 15B/TP4 checkpointing (checkpoint size is 205GB) reserves 70GB of process memory and 290GB of CPU/buffered disk IO memory (360GB total). And when the run continues, sometimes I see OOM/sigkill.
After running (outside docker) most of that memory gets released:
free && sync && echo 3 > /proc/sys/vm/drop_caches && free
In my case it happens even when TP_size=1 and PP_size=1, that is, using only DP.
@mikolajblaz here I attach the full logs as well as the configuration file used in the experiment (llama 2 7B continual pretraining using NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py
).
So far, no more OOMs/sigkill, when free memory target is set to 64GB. But I was only seeing the crash sporadically.
sysctl -w vm.min_free_kbytes=$((64 * 1024 * 1024))
Hello, I'm facing a similar issue. @dchichkov would you mind sharing the output of:
scontrol show node <node-name> | grep CfgTRES
In our case it's:
CfgTRES=cpu=192,mem=1000000M,billing=192,gres/gpu=8
However when I check the amount of memory available on that node, it's 1056270564 kB
:
cat /proc/meminfo | grep MemTotal
is showing:
MemTotal: 1056270564 kB
I suspect this could be simply a mismatch between what's configured in Slurm and the total amount of memory on the node. And that by setting free memory target to 64GB you simply force the OS to keep the allocated memory below what's configured in Slurm.
I'm also using 2407
NeMo image.
After setting target memory to 64GB it also solved the problem for me.
I have tried training Llama 2, Llama 3.1 and Mistral models using the new 24.07 container but the process errors out after 200 or 300 steps. It always happens when a checkpoint is being created. I have ensured to set
cpu_offloading: false
anduse_cpu_initialization: false
.The training works well when using the 24.05 version and the exact same configuration. So I don't understand why this is happening. I would appreciate some help on this matter.
These are the system specifications (I use 8 nodes for training):
And this is the error I get in SLURM:
srun: error: as01r3b17: task 3: Out Of Memory srun: Terminating StepId=4777403.0 slurmstepd: error: STEP 4777403.0 ON as01r3b17 CANCELLED AT 2024-08-16T19:36:54 slurmstepd: error: Detected 1 oom_kill event in StepId=4777403.0. Some of the step tasks have been OOM Killed. /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 23 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 23 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' slurmstepd: error: Detected 1 oom_kill event in StepId=4777403.0. Some of the step tasks have been OOM Killed.