Closed Grey4sh closed 5 months ago
You can try version 0.5.2 of the llama factory.
You can try version 0.5.2 of the llama factory.
Thanks for reply. I will try it later and feedback on the lastest results.
You can try version 0.5.2 of the llama factory.
Same problem.
a100-80gb*16
batchsize: 44\16
sequence lenght: 8192
deepspeed ZeRO3
Tried to allocate 22.50 GiB. GPU 3 has a total capacty of 79.15 GiB of which 1.95 GiB is free. Including non-PyTorch memory, this pro cess has 77.19 GiB memory in use. Of the allocated memory 53.30 GiB is allocated by PyTorch, and 23.07 GiB is reserved by PyTorch but unallocated.
batchsize: 4*4*16, means gradient_accumulation_steps=4, per_device_batchsize=4?
We recommend you to set per_device_batchsize=1, and gradient_accumulation_steps=16 to remain the global batchsize as the same as 256.
batchsize: 4416, means gradient_accumulation_steps=4, per_device_batchsize=4?
We recommend you to set per_device_batchsize=1, and gradient_accumulation_steps=16 to remain the global batchsize as the same as 256.
Under the settings of per_device_batchsize=1
and gradient_accumulation_steps=16
, the GPU memory utilization is only 50%. It can complete 1 epoch normally, but an OOM error occurs at 1.7 epochs. I encountered a similar situation before without changing the batch size parameter, where an OOM error occurred at 1.13 epochs. Is there something special about the CodeQwen model architecture? I've never encountered a similar issue when training other models, the GPU memory utilization is too low, I've never encountered this many OOM errors before.
did you use the same size and seqlen of other models? It seems like a simple case of insufficient GPU memory. You could try methods like model parallel.
did you use the same size and seqlen of other models? It seems like a simple case of insufficient GPU memory. You could try methods like model parallel.
I did a few training works before. For example, when i tried to fine-tune deepseekcoder-6.7b with 8192 seqlen & 256 batchsize(gradient_accumulation_steps=4, per_device_batchsize=4, 16 gpus) & ZeRO3, the max per gpu memory usage maybe around 40-50 GB.
Could you explain the reason for such uneven GPU memory usage shown in the nvidia-smi output I displayed above?
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB Off | 00000000:1F:00.0 Off | 0 |
| N/A 50C P0 116W / 400W | 74831MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB Off | 00000000:25:00.0 Off | 0 |
| N/A 65C P0 148W / 400W | 69291MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB Off | 00000000:50:00.0 Off | 0 |
| N/A 66C P0 125W / 400W | 60269MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB Off | 00000000:55:00.0 Off | 0 |
| N/A 52C P0 125W / 400W | 36859MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB Off | 00000000:90:00.0 Off | 0 |
| N/A 52C P0 147W / 400W | 36783MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB Off | 00000000:95:00.0 Off | 0 |
| N/A 66C P0 163W / 400W | 36961MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB Off | 00000000:CB:00.0 Off | 0 |
| N/A 64C P0 123W / 400W | 60133MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB Off | 00000000:D1:00.0 Off | 0 |
| N/A 50C P0 140W / 400W | 36889MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
we have not encountered this kind of situation before. Is there other procedure runing?
The issue has been resolved. It was found that OOM (Out of Memory) errors occurred during training when performing evaluations. The solution was to set do_eval
to false
when using llamaFactory
. However, the memory allocation for CodeQwen1.5 is still peculiar. For instance, six GPUs use 60GB of memory, while the remaining two GPUs use 78GB.
问题描述
codeqwen1.5-7B在进行continue pretrain时所用显存异常地大,且在训练一段时间后出现OOM
系统环境
一开始发生OOM时我使用的是2节点,16张GPU
复现脚本
训练框架为LLaMA-Factory-0.7.0
之前我曾进行过多次模型训练,正常情况下训练7B的模型在这个batchsize与cutoff_len下不会爆OOM,并且通过nvidia-smi时能看出显存分配很不均匀。
暂时不清楚是训练框架的原因还是模型架构的原因,希望有大佬能解答。