70B training fails - Githubissues

kumar-shridhar commented 1 year ago

Hi,

I am trying to run script to finetune 70B model and I am getting this error: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2699234) of binary:

Any idea what could be the issue? I am able to train 13B models and I follow all the dependencies version mentioned in the past issues.

Thanks.

GanjinZero commented 1 year ago

It may be out-of-memory on RAM. What is your RAM size? Please give more detailed environment and traces.

kumar-shridhar commented 1 year ago

I am using 4 nodes with 8 GPUs (A100 80GB) each. RAM is 1.1TB per node with 96CPUs.

This is my error:

[2023-10-10 01:06:24,022] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,067] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,090] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,151] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,163] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,165] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,190] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,319] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327337 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327338 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327339 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327340 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327341 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327343 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327344 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 5 (pid: 3327342) of binary:

My environment is as follows:

numpy==1.25.2
torch==2.0.1
torchvision==0.15.2
triton==2.0.0
deepspeed
sentencepiece
wandb
accelerate==0.21.0

Yuanhy1997 commented 1 year ago

It seems that the communications between GPUs or nodes fails. And what is your transformers version?

kumar-shridhar commented 1 year ago

The strange part is that the nodes communication works for 13B if I try to run 13B model over multiple nodes. But it is failing for 70B. My transformer version is 4.29.2

Yuanhy1997 commented 1 year ago

Can you try 33B? And another possible reason is that do 4.29.2 support 70B llama2? I suspect the transformers' version would be the case.

kumar-shridhar commented 1 year ago

You were right. Transformers version was unable to recognize 70B llama2 model. Upgrading it solved the issue. Thanks again.

OFA-Sys / gsm8k-ScRel

70B training fails #16