Closed kumar-shridhar closed 1 year ago
It may be out-of-memory on RAM. What is your RAM size? Please give more detailed environment and traces.
I am using 4 nodes with 8 GPUs (A100 80GB) each. RAM is 1.1TB per node with 96CPUs.
This is my error:
[2023-10-10 01:06:24,022] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,067] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,090] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,151] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,163] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,165] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,190] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-10 01:06:24,319] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327337 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327338 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327339 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327340 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327341 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327343 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3327344 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 5 (pid: 3327342) of binary:
My environment is as follows:
numpy==1.25.2
torch==2.0.1
torchvision==0.15.2
triton==2.0.0
deepspeed
sentencepiece
wandb
accelerate==0.21.0
It seems that the communications between GPUs or nodes fails. And what is your transformers version?
The strange part is that the nodes communication works for 13B if I try to run 13B model over multiple nodes. But it is failing for 70B.
My transformer version is 4.29.2
Can you try 33B? And another possible reason is that do 4.29.2 support 70B llama2? I suspect the transformers' version would be the case.
You were right. Transformers version was unable to recognize 70B llama2 model. Upgrading it solved the issue. Thanks again.
Hi,
I am trying to run script to finetune 70B model and I am getting this error:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2699234) of binary:
Any idea what could be the issue? I am able to train 13B models and I follow all the dependencies version mentioned in the past issues.
Thanks.