Closed cuonglp1713 closed 1 week ago
I realize that CPU memory leaked in every training step but I don't know why. I tried to change batch_size, num_workker, log_every_n_steps but nothing works. Anyone has an idea? @dimapihtar
Hi, could you try to include these params to your run?
model.use_cpu_initialization=False
model.cpu_offloading=False
I added those 2 params to my script but the behavior is still the same @dimapihtar
I added those 2 params to my script but the behavior is still the same @dimapihtar
I tried to run job with exactly the same config as you attached but I don't have this issue. Could you try to use newer 24.05 container? nvcr.io/nvidia/nemo:24.05
I tried to run job with exactly the same config as you attached but I don't have this issue. Could you try to use newer 24.05 container?
nvcr.io/nvidia/nemo:24.05
That's weird! I tried on 2 nemo version: nvcr.io/nvidia/nemo:24.03.framework and nvcr.io/nvidia/nemo:24.05.01 but the issue remains the same. I will try nvcr.io/nvidia/nemo:24.05 but I don't think it will be different
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Did you run with torchrun
? try to comment mp.set_start_method("spawn", force=True)
in the megatron_gpt_finetuning.py
Describe the bug
Im training Llama3-8b with Nemo framework on 4 A100 80GB GPU. During training, my CPU RAM keeps increasing in every step and it consumed in buff/cache memory. My server has 2TB CPU RAM and it reached 100% with just 20 steps (my epoch has total 900 steps).
Edit: Both my shared and cache memory increase continuously. Soon my training task will collap.
Steps/Code to reproduce bug Here is my training script (https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html):
Environment overview Environment location: Nemo framework: nvcr.io/nvidia/nemo:24.03.01.framework