[Bug]: Qwen2 moe out of memory #954

Model Series


What are the models used?


What is the scenario where the problem happened?

train with transformers

Is this a known issue?

Information about environment

Log output

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB. GPU 1 has a total capacty of 79.33 GiB of which 11.81 MiB is free. Process 3080904 has 79.30 GiB memory in use. Of the allocated memory 77.98 GiB is allocated by PyTorch, and 212.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLO[[202[2024-09-24 10:50:59,051] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 35) of binary: /opt/conda/bin/python


Hi! When I use transformers to sft Qwen2-57B-A14B on 32 x A100 with 2048 input length, it encounter oom in the backward stage. is there something wrong with my setting?

jklj077 commented 2 days ago

For reference, full parameter finetuning for Qwen2-57B-A14B should be possible with 2 8 80GB GPUs with 4K sequence length (estimated minimum). However, you should enable an mixture of tensor/expert/pipeline/ parallelism, e.g., pp4tp4 or pp2ep8. 4 8 80GB GPUs should be preferred.

It is recommend to check whether the training framework you adopted support those kinds of configurations. Ultimately, it would be best if you could take a closer look at your own code to identify any issues.

FL77N commented 2 days ago

Thanks for your quick update! My training framework is transformers, it only supports data parallelism and deepspeed zero3 strategy. However,when I sft it with 8 8 80GB GPUs with 2K sequence length, it is still oom.Maybe it is best to train it with the training framework like Megatron-LM.