HuangLK / transpeeder

train llama on a single A100 80G node using 🤗 transformers and 🚀 Deepspeed Pipeline Parallelism
Apache License 2.0
208 stars 18 forks source link

Running 7b succeed. next 30B #22

Open hudengjunai opened 1 year ago

hudengjunai commented 1 year ago

image

Thank your for your implementation of pipeline parallel for llama model training. I have encountered hang when run 7b training in 4xA40 training machine. can you give me a Dockerfile than can running in some machine?

HuangLK commented 1 year ago

check the lock file in the extensions directory and remove it if exists.

hudengjunai commented 1 year ago

check the lock file in the extensions directory and remove it if exists.

Thany you for your reply. Finally I have succeed running the 7b pipeline parallel. And the batch-size is much bigger than origin zero-dp mode.

Currently. I have tried using mp=4 for llama-13b with the same code. but the CPU memory oom. And I am try use 2 node 4xA100-4OG =8GPUs to run the llama-13B.

hudengjunai commented 1 year ago

image

hudengjunai commented 1 year ago

13B with two nodes succeed.

lw3259111 commented 1 year ago

@hudengjunai i use 2 node 8 GPU for training but the error information is :

RuntimeError: Timed out initializing process group in store based barrier on rank: 11, for key: store_based_barrier_key:37 (world_size=16, worker_count=8, timeout=0:30:00)

how to solve the problem

iMountTai commented 1 year ago

请问llama7b在P4024G显卡上能跑起来吗?我这边一直被kill

lw3259111 commented 1 year ago

请问llama7b在P4024G显卡上能跑起来吗?我这边一直被kill

感觉你的内存不足

iMountTai commented 1 year ago

请问llama7b在P4024G显卡上能跑起来吗?我这边一直被kill

感觉你的内存不足

请问您遇到过这个问题吗? image

lw3259111 commented 1 year ago

check the lock file in the extensions directory and remove it if exists.

Thany you for your reply. Finally I have succeed running the 7b pipeline parallel. And the batch-size is much bigger than origin zero-dp mode.

Currently. I have tried using mp=4 for llama-13b with the same code. but the CPU memory oom. And I am try use 2 node 4xA100-4OG =8GPUs to run the llama-13B.

请问你用的是Pcie版本的以太网吗?,还是用的IB网络?

OAfzal commented 10 months ago

@hudengjunai could you please tell your machine configuration and your deepspeed and transformers configuration for training llama2-7b using pipeline parallel? I am getting CUDA OOM on 4xA100(40G) on the default args.

image