OFA-Sys / gsm8k-ScRel

Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
https://arxiv.org/abs/2308.01825
212 stars 16 forks source link

70B training fails #16

Closed kumar-shridhar closed 11 months ago

kumar-shridhar commented 11 months ago

Hi,

I am trying to run script to finetune 70B model and I am getting this error: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2699234) of binary:

Any idea what could be the issue? I am able to train 13B models and I follow all the dependencies version mentioned in the past issues.

Thanks.

GanjinZero commented 11 months ago

It may be out-of-memory on RAM. What is your RAM size? Please give more detailed environment and traces.

kumar-shridhar commented 11 months ago

I am using 4 nodes with 8 GPUs (A100 80GB) each. RAM is 1.1TB per node with 96CPUs.

This is my error:

[2023-10-10 01:06:24,022] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,067] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,090] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,151] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,163] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,165] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,190] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,319] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327337 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327338 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327339 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327340 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327341 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327343 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327344 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 5 (pid: 3327342) of binary:

My environment is as follows:

numpy==1.25.2
torch==2.0.1
torchvision==0.15.2
triton==2.0.0
deepspeed
sentencepiece
wandb
accelerate==0.21.0
Yuanhy1997 commented 11 months ago

It seems that the communications between GPUs or nodes fails. And what is your transformers version?

kumar-shridhar commented 11 months ago

The strange part is that the nodes communication works for 13B if I try to run 13B model over multiple nodes. But it is failing for 70B. My transformer version is 4.29.2

Yuanhy1997 commented 11 months ago

Can you try 33B? And another possible reason is that do 4.29.2 support 70B llama2? I suspect the transformers' version would be the case.

kumar-shridhar commented 11 months ago

You were right. Transformers version was unable to recognize 70B llama2 model. Upgrading it solved the issue. Thanks again.