Implementing Pipeline Parallelism with LLaMA Models and Utilizing deepspeed for Execution?

ShuaipengWu commented 4 months ago

Hi,

I'm currently experimenting with fine-tuning some small LLaMA models (LLaMA2-7b) and I'm interested in utilizing pipeline parallelism. However, there are few examples available for reference. I also looked into the repository for chatGLM with pipeline parallelism that you listed below, chatglm finetuning I tried to implement the pipeline layers similar to chatGLM3. Unfortunately, I encountered several failures.

I noticed that you have a set of wrappers in llama_ds_mp_wrap.py that resemble the reference implementation, but it seems they are not used in the end. Is this approach not feasible?

Consequently, I turned to LLaMA and attempted to establish a runnable demo with pipeline parallelism. I would like to know if using the deepspeed command to run trainer_base_ds_mp.py is the correct way to execute this code.

Thanks in advance.

SparkJiao commented 4 months ago

Hi, thanks for your interests.

For the reason why do not use the wrappers, it is because the wrappers implemented in the ChatGLM repo require the original model as a parameter to initialize the pipe layers. I understand that by this way you do not need to first convert the HF weights into deespeed format. But you need to load the complete weights at each rank, which will cause OOM (CPU memory, imagine that you are trying to load 70B models using 8 processes, then you have 8 copies). My implementation indeed works like this: (1) use the pre-trained config to initialize the specific layer (2) load the correpending weights from the disk (this is managed by deepspeed), so that you only need one complete copy of the weights.

For the launch, yes, simply call deepspeed trainer_base_ds_mp.py is ok. But you need to specify the specific number of pipeline parallel stages in the config by your self.

ShuaipengWu commented 4 months ago

Hi, thanks for your interests.

For the reason why do not use the wrappers, it is because the wrappers implemented in the ChatGLM repo require the original model as a parameter to initialize the pipe layers. I understand that by this way you do not need to first convert the HF weights into deespeed format. But you need to load the complete weights at each rank, which will cause OOM (CPU memory, imagine that you are trying to load 70B models using 8 processes, then you have 8 copies). My implementation indeed works like this: (1) use the pre-trained config to initialize the specific layer (2) load the correpending weights from the disk (this is managed by deepspeed), so that you only need one complete copy of the weights.

For the launch, yes, simply call deepspeed trainer_base_ds_mp.py is ok. But you need to specify the specific number of pipeline parallel stages in the config by your self.

Thank you for your response. The glm's implementation of wrapping is equivalent to loading all parameters first (each process will load the whole parameters to memory) and then allocating the parameters to the GPUs(e.g layers 0-3 onto GPU0 and 4-7 to GPU1) . This approach may be suitable for smaller models, but for particularly large models (70B), using your approach, which involves configuring which layers need to load which parameters first, and then having deepspeed load the corresponding parameters from disk directly into the corresponding layers(to GPU memory), is my understanding above correct?

SparkJiao commented 4 months ago

Yes. The two implementations should not have too much difference.

But I remember that I also encountered some problems while using his implementation. Previously I raised one issue. I think may because the different internal implementations between Llama and ChatGLM. For simplicity, maybe you can simply use my approach, but only change the __init__ method to load weights from the whole model to help fast reproduce.

SparkJiao / llama-pipeline-parallel

Implementing Pipeline Parallelism with LLaMA Models and Utilizing deepspeed for Execution? #7