codeqwen1.5-7B在进行continue pretrain时所用显存异常地大，且在训练一段时间后出现OOM

Cucunnber commented 1 month ago

问题描述

ib125:     return F.cross_entropy(input, target, weight=self.weight,
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
ib125:     loss = loss_fct(shift_logits, shift_labels)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ib125:     return self._call_impl(*args, **kwargs)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ib125:     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ib125: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.50 GiB. GPU 3 has a total capacty of 79.15 GiB of which 7.69 GiB is free. Including non-PyTorch memory, this process 4 GiB memory in use. Of the allocated memory 47.39 GiB is allocated by PyTorch, and 23.23 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_ze_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ib125:     return forward_call(*args, **kwargs)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1179, in forward
ib125:     return F.cross_entropy(input, target, weight=self.weight,
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
ib125:     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ib125: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.50 GiB. GPU 1 has a total capacty of 79.15 GiB of which 7.39 GiB is free. Including non-PyTorch memory, this process 4 GiB memory in use. Of the allocated memory 47.40 GiB is allocated by PyTorch, and 23.52 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_ze_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ib125: Traceback (most recent call last):
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/train.py", line 14, in <module>
ib125:     main()
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/train.py", line 5, in main
ib125:     run_exp()
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/llmtuner/train/tuner.py", line 31, in run_exp
ib125:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/llmtuner/train/pt/workflow.py", line 47, in run_pt
ib125:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:1F:00.0 Off |                    0 |
| N/A   50C    P0             116W / 400W |  74831MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:25:00.0 Off |                    0 |
| N/A   65C    P0             148W / 400W |  69291MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off | 00000000:50:00.0 Off |                    0 |
| N/A   66C    P0             125W / 400W |  60269MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off | 00000000:55:00.0 Off |                    0 |
| N/A   52C    P0             125W / 400W |  36859MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          Off | 00000000:90:00.0 Off |                    0 |
| N/A   52C    P0             147W / 400W |  36783MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          Off | 00000000:95:00.0 Off |                    0 |
| N/A   66C    P0             163W / 400W |  36961MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          Off | 00000000:CB:00.0 Off |                    0 |
| N/A   64C    P0             123W / 400W |  60133MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          Off | 00000000:D1:00.0 Off |                    0 |
| N/A   50C    P0             140W / 400W |  36889MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

系统环境

一开始发生OOM时我使用的是2节点，16张GPU

A100-SXM4-80GB X 16

- `transformers` version: 4.41.1
- Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.23.1
- Safetensors version: 0.4.2
- Accelerate version: 0.27.2
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: True
        - num_processes: 16
        - machine_rank: 0
        - num_machines: 2
        - main_process_ip: 2.0.0.1
        - main_process_port: 9995
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'deepspeed_config_file': 'deepspeed_z2_config_bf16.json', 'deepspeed_multinode_launcher': 'standard', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

复现脚本

训练框架为LLaMA-Factory-0.7.0

export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=eth10

model_path=codeqwen1.5-7B

dataset=codeqwen_0305

outputdir=codeqwen-pt-0527-new0305dataset

gradient_accumulation_steps=2

per_device_batchsize=2

epoch_num=2

learning_rate=1.5e-05

deepspeed  --hostfile hostfile.txt --master_addr=2.0.0.1 src/train.py --model_name_or_path $model_path  --stage pt \
--dataset $dataset \
--finetuning_type  full \
--overwrite_cache  true \
--flash_attn fa2 \
--preprocessing_num_workers 64 \
--template default \
--output_dir $outputdir \
--bf16  true  \
--lr_scheduler_type  cosine \
--do_train  true  \
--do_eval true \
--packing false \
--gradient_accumulation_steps  $gradient_accumulation_steps \
--gradient_checkpointing  true \
--learning_rate  $learning_rate \
--log_level  passive \
--logging_steps  10 \
--logging_strategy  steps \
--max_steps  -1 \
--num_train_epochs $epoch_num \
--report_to tensorboard \
--weight_decay 0.01 \
--cutoff_len 8192 \
--warmup_ratio 0.02 \
--eval_steps 200 \
--val_size 0.01 \
--evaluation_strategy steps \
--overwrite_output_dir  true  \
--per_device_train_batch_size  $per_device_batchsize \
--remove_unused_columns  true \
--save_strategy epoch \
--plot_loss \
--save_total_limit 3 \
--save_safetensors  true  \
--deepspeed=ds_z3_lr_schedule.json

之前我曾进行过多次模型训练，正常情况下训练7B的模型在这个batchsize与cutoff_len下不会爆OOM，并且通过nvidia-smi时能看出显存分配很不均匀。

暂时不清楚是训练框架的原因还是模型架构的原因，希望有大佬能解答。

cyente commented 1 month ago

You can try version 0.5.2 of the llama factory.

Cucunnber commented 1 month ago

You can try version 0.5.2 of the llama factory.

Thanks for reply. I will try it later and feedback on the lastest results.

Cucunnber commented 1 month ago

You can try version 0.5.2 of the llama factory.

Same problem.

a100-80gb*16

batchsize: 44\16

sequence lenght: 8192

deepspeed ZeRO3


Tried to allocate 22.50 GiB. GPU 3 has a total capacty of 79.15 GiB of which 1.95 GiB is free. Including non-PyTorch memory, this pro             cess has 77.19 GiB memory in use. Of the allocated memory 53.30 GiB is allocated by PyTorch, and 23.07 GiB is reserved by PyTorch but unallocated.

cyente commented 1 month ago

batchsize: 4*4*16, means gradient_accumulation_steps=4, per_device_batchsize=4?

We recommend you to set per_device_batchsize=1, and gradient_accumulation_steps=16 to remain the global batchsize as the same as 256.

Cucunnber commented 1 month ago

batchsize: 4416, means gradient_accumulation_steps=4, per_device_batchsize=4?

We recommend you to set per_device_batchsize=1, and gradient_accumulation_steps=16 to remain the global batchsize as the same as 256.

Under the settings of per_device_batchsize=1 and gradient_accumulation_steps=16, the GPU memory utilization is only 50%. It can complete 1 epoch normally, but an OOM error occurs at 1.7 epochs. I encountered a similar situation before without changing the batch size parameter, where an OOM error occurred at 1.13 epochs. Is there something special about the CodeQwen model architecture? I've never encountered a similar issue when training other models, the GPU memory utilization is too low, I've never encountered this many OOM errors before.

cyente commented 1 month ago

did you use the same size and seqlen of other models? It seems like a simple case of insufficient GPU memory. You could try methods like model parallel.

Cucunnber commented 1 month ago

did you use the same size and seqlen of other models? It seems like a simple case of insufficient GPU memory. You could try methods like model parallel.

I did a few training works before. For example, when i tried to fine-tune deepseekcoder-6.7b with 8192 seqlen & 256 batchsize(gradient_accumulation_steps=4, per_device_batchsize=4, 16 gpus) & ZeRO3, the max per gpu memory usage maybe around 40-50 GB.

Could you explain the reason for such uneven GPU memory usage shown in the nvidia-smi output I displayed above?


+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:1F:00.0 Off |                    0 |
| N/A   50C    P0             116W / 400W |  74831MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:25:00.0 Off |                    0 |
| N/A   65C    P0             148W / 400W |  69291MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off | 00000000:50:00.0 Off |                    0 |
| N/A   66C    P0             125W / 400W |  60269MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off | 00000000:55:00.0 Off |                    0 |
| N/A   52C    P0             125W / 400W |  36859MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          Off | 00000000:90:00.0 Off |                    0 |
| N/A   52C    P0             147W / 400W |  36783MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          Off | 00000000:95:00.0 Off |                    0 |
| N/A   66C    P0             163W / 400W |  36961MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          Off | 00000000:CB:00.0 Off |                    0 |
| N/A   64C    P0             123W / 400W |  60133MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          Off | 00000000:D1:00.0 Off |                    0 |
| N/A   50C    P0             140W / 400W |  36889MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

cyente commented 1 month ago

we have not encountered this kind of situation before. Is there other procedure runing?

Cucunnber commented 1 month ago

The issue has been resolved. It was found that OOM (Out of Memory) errors occurred during training when performing evaluations. The solution was to set do_eval to false when using llamaFactory. However, the memory allocation for CodeQwen1.5 is still peculiar. For instance, six GPUs use 60GB of memory, while the remaining two GPUs use 78GB.

QwenLM / CodeQwen1.5