CPU memory keeps increasing in every step during training LLM with Nemo framework?

cuonglp1713 commented 1 month ago

Describe the bug

Im training Llama3-8b with Nemo framework on 4 A100 80GB GPU. During training, my CPU RAM keeps increasing in every step and it consumed in buff/cache memory. My server has 2TB CPU RAM and it reached 100% with just 20 steps (my epoch has total 900 steps).

Edit: Both my shared and cache memory increase continuously. Soon my training task will collap.

Steps/Code to reproduce bug Here is my training script (https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html):

torchrun --nproc_per_node=4 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
   trainer.devices=4 \
   trainer.num_nodes=1 \
   trainer.val_check_interval=50 \
   trainer.max_epochs=1 \
   trainer.max_steps=900 \
   model.restore_from_path="my_model.nemo" \
   model.micro_batch_size=1 \
   model.global_batch_size=256 \
   model.tensor_model_parallel_size=4 \
   model.pipeline_model_parallel_size=1 \
   model.megatron_amp_O2=True \
   model.sequence_parallel=True \
   model.activations_checkpoint_granularity=selective \
   model.activations_checkpoint_method=uniform \
   model.optim.name=distributed_fused_adam \
   model.optim.lr=5e-6 \
   model.answer_only_loss=True \
   model.peft.peft_scheme=none \
   model.data.train_ds.file_names="my_train_data.jsonl" \
   model.data.validation_ds.file_names="my_val_data.jsonl" \
   model.data.test_ds.file_names="my_test_data.jsonl" \
   model.data.train_ds.concat_sampling_probabilities="[1]" \
   model.data.train_ds.max_seq_length=8192 \
   model.data.validation_ds.max_seq_length=8192 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=256 \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=32 \
   model.data.test_ds.micro_batch_size=1 \
   model.data.test_ds.global_batch_size=32 \
   +model.data.num_workers=4 \
   model.data.validation_ds.metric.name=loss \
   model.data.test_ds.metric.name=loss \
   exp_manager.create_wandb_logger=False \
   exp_manager.explicit_log_dir=results_v2 \
   exp_manager.resume_if_exists=False \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=step \
   exp_manager.checkpoint_callback_params.mode='max' \
   exp_manager.checkpoint_callback_params.save_top_k=4 \
   exp_manager.checkpoint_callback_params.save_best_model=True \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True

Environment overview Environment location: Nemo framework: nvcr.io/nvidia/nemo:24.03.01.framework

cuonglp1713 commented 1 month ago

I realize that CPU memory leaked in every training step but I don't know why. I tried to change batch_size, num_workker, log_every_n_steps but nothing works. Anyone has an idea? @dimapihtar

dimapihtar commented 1 month ago

Hi, could you try to include these params to your run? model.use_cpu_initialization=False model.cpu_offloading=False

cuonglp1713 commented 1 month ago

I added those 2 params to my script but the behavior is still the same @dimapihtar

dimapihtar commented 1 month ago

I added those 2 params to my script but the behavior is still the same @dimapihtar

I tried to run job with exactly the same config as you attached but I don't have this issue. Could you try to use newer 24.05 container? nvcr.io/nvidia/nemo:24.05

cuonglp1713 commented 1 month ago

I tried to run job with exactly the same config as you attached but I don't have this issue. Could you try to use newer 24.05 container? nvcr.io/nvidia/nemo:24.05

That's weird! I tried on 2 nemo version: nvcr.io/nvidia/nemo:24.03.framework and nvcr.io/nvidia/nemo:24.05.01 but the issue remains the same. I will try nvcr.io/nvidia/nemo:24.05 but I don't think it will be different

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

GHGmc2 commented 50 minutes ago

Did you run with torchrun? try to comment mp.set_start_method("spawn", force=True) in the megatron_gpt_finetuning.py

NVIDIA / NeMo

CPU memory keeps increasing in every step during training LLM with Nemo framework? #9727