NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.64k stars 2.44k forks source link

Out of RAM using 24.07 container #10192

Open aimarz opened 1 month ago

aimarz commented 1 month ago

I have tried training Llama 2, Llama 3.1 and Mistral models using the new 24.07 container but the process errors out after 200 or 300 steps. It always happens when a checkpoint is being created. I have ensured to set cpu_offloading: false and use_cpu_initialization: false.

The training works well when using the 24.05 version and the exact same configuration. So I don't understand why this is happening. I would appreciate some help on this matter.

These are the system specifications (I use 8 nodes for training):

And this is the error I get in SLURM:

srun: error: as01r3b17: task 3: Out Of Memory srun: Terminating StepId=4777403.0 slurmstepd: error: STEP 4777403.0 ON as01r3b17 CANCELLED AT 2024-08-16T19:36:54 slurmstepd: error: Detected 1 oom_kill event in StepId=4777403.0. Some of the step tasks have been OOM Killed. /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 23 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 23 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' slurmstepd: error: Detected 1 oom_kill event in StepId=4777403.0. Some of the step tasks have been OOM Killed.

mikolajblaz commented 1 month ago

@aimarz can you post a reproducer and the whole log?

dchichkov commented 1 month ago

Yes, this seems similar to what I've observed on 15B / 24.07. I've noticed that locally, a 15B/TP4 checkpointing (checkpoint size is 205GB) reserves 70GB of process memory and 290GB of CPU/buffered disk IO memory (360GB total). And when the run continues, sometimes I see OOM/sigkill.

After running (outside docker) most of that memory gets released: free && sync && echo 3 > /proc/sys/vm/drop_caches && free

aimarz commented 1 month ago

In my case it happens even when TP_size=1 and PP_size=1, that is, using only DP.

@mikolajblaz here I attach the full logs as well as the configuration file used in the experiment (llama 2 7B continual pretraining using NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py).

issue_config.yaml.txt issue_log_ERR.txt issue_log_OUT.txt

dchichkov commented 1 month ago

So far, no more OOMs/sigkill, when free memory target is set to 64GB. But I was only seeing the crash sporadically.

sysctl -w vm.min_free_kbytes=$((64 * 1024 * 1024))

evellasques commented 3 weeks ago

Hello, I'm facing a similar issue. @dchichkov would you mind sharing the output of:

scontrol show node <node-name> | grep CfgTRES

In our case it's:

CfgTRES=cpu=192,mem=1000000M,billing=192,gres/gpu=8

However when I check the amount of memory available on that node, it's 1056270564 kB:

cat /proc/meminfo | grep MemTotal

is showing:

MemTotal:       1056270564 kB 

I suspect this could be simply a mismatch between what's configured in Slurm and the total amount of memory on the node. And that by setting free memory target to 64GB you simply force the OS to keep the allocated memory below what's configured in Slurm.

I'm also using 2407 NeMo image.

After setting target memory to 64GB it also solved the problem for me.