Open hudengjunai opened 1 year ago
check the lock
file in the extensions directory and remove it if exists.
check the
lock
file in the extensions directory and remove it if exists.
Thany you for your reply. Finally I have succeed running the 7b pipeline parallel. And the batch-size is much bigger than origin zero-dp mode.
Currently. I have tried using mp=4 for llama-13b with the same code. but the CPU memory oom. And I am try use 2 node 4xA100-4OG =8GPUs to run the llama-13B.
13B with two nodes succeed.
@hudengjunai i use 2 node 8 GPU for training but the error information is :
RuntimeError: Timed out initializing process group in store based barrier on rank: 11, for key: store_based_barrier_key:37 (world_size=16, worker_count=8, timeout=0:30:00)
how to solve the problem
请问llama7b在P4024G显卡上能跑起来吗?我这边一直被kill
请问llama7b在P4024G显卡上能跑起来吗?我这边一直被kill
感觉你的内存不足
请问llama7b在P4024G显卡上能跑起来吗?我这边一直被kill
感觉你的内存不足
请问您遇到过这个问题吗?
check the
lock
file in the extensions directory and remove it if exists.Thany you for your reply. Finally I have succeed running the 7b pipeline parallel. And the batch-size is much bigger than origin zero-dp mode.
Currently. I have tried using mp=4 for llama-13b with the same code. but the CPU memory oom. And I am try use 2 node 4xA100-4OG =8GPUs to run the llama-13B.
请问你用的是Pcie版本的以太网吗?,还是用的IB网络?
@hudengjunai could you please tell your machine configuration and your deepspeed and transformers configuration for training llama2-7b using pipeline parallel? I am getting CUDA OOM on 4xA100(40G) on the default args.
Thank your for your implementation of pipeline parallel for llama model training. I have encountered hang when run 7b training in 4xA40 training machine. can you give me a Dockerfile than can running in some machine?