bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.31k stars 213 forks source link

Generation server using HF accelerate and DS inference #321

Closed mayank31398 closed 2 years ago

mayank31398 commented 2 years ago

This PR depends on There are some redundant methods in some scripts that can be removed once https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/308 is merged into main branch This PR is for adding scripts for creating a generation server using both HF accelerate and DeepSpeed inference.

mayank31398 commented 2 years ago

@stas00 @RezaYazdaniAminabadi Currently, batch size is fixed and equal to 1 HF accelerate script is working correclty. Working on DS inference

pai4451 commented 2 years ago

@mayank31398 Did you test the code bloom-ds-server.py before? When using DeepSpeed, a world_size number of processes will be launched. How can you create flask app with same port for all those processes?

mayank31398 commented 2 years ago

Hi @pai4451 you cant. This code is still not working I am working on changing that. Ill have to get around this with a bit of hacks Its not easy to serve ds inference using flask

mayank31398 commented 2 years ago

I have pull the latest bloom-inference into this branch.

mayank31398 commented 2 years ago

@pai4451 I am open to suggestions if you have. I was thinking of running a server for each of the 8 processes and the 0th process will receive a generate request from the user and call the server of other processes. But I am not sure how much overhead this would cause.

pai4451 commented 2 years ago

@mayank31398 Thanks, I’m also working on serving BLOOM with DeepSpeed. I think this solution might work, but in terms of serving we have to consider the maintenance cost. The difficult part I think is to keep all processes stable (alive and synchronized).

mayank31398 commented 2 years ago

@pai4451 can you give this latest code a try? I am able to run the server but the code gets stuck on model.generate() I don't really understand why

mayank31398 commented 2 years ago

What the current code is doing: It creates 8 servers, 1 on the main HOST:IP and other 7 on 127.0.0.1:IP+1, 127.0.0.1:IP+2, ... The main server sends call to other 7 servers to run the generate method. I see that the code just gets stuck in line 165 after first request is sent.

The code is working upto line 164 though (when a request is sent). I see tokenized input on all 8 processes.

@pai4451

pai4451 commented 2 years ago

@pai4451 can you give this latest code a try? I am able to run the server but the code gets stuck on model.generate() I don't really understand why

@mayank31398 I also get stuck on the line model.generate(). Maybe some processes failed to communicate with others or the processes are not synchronized? I doubt the way to launch the server via deepspeed might cause some processes communication problems.

mayank31398 commented 2 years ago

@pai4451 DS inference server is working now. I have deployed using DeepSpeed MII. This is a new library just released by the DeepSpeed team. ❤️

You can use the scripts now. Instructions are in README

mayank31398 commented 2 years ago

@stas00 , i would like to contribute this to the bloom-inference branch if its all right? Currently, the scripts are only working with batch size = 1 2 scripts have been added (with a little code refactoring of other scripts). I am working on increasing the batch size 🤗 now.

pai4451 commented 2 years ago

@pai4451 DS inference server is working now. I have deployed using DeepSpeed MII. This is a new library just released by the DeepSpeed team. ❤️

You can use the scripts now. Instructions are in README

Do you think the code bloom-ds-server.py can be run on two nodes? With my current hardware limit, I have to use two nodes to accommodate the entire BLOOM model.

mayank31398 commented 2 years ago

I am not sure. I have tested with 1 node with 8 x 80GB A100 GPUs. Even if you can run it on 2 nodes, the original Megatron-LM paper doesn't recommend spanning tensor parallelism across nodes. This drastically reduces performance.

mayank31398 commented 2 years ago

I screwed up this PR ❤️ @pai4451

mayank31398 commented 2 years ago

Moving to https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/325

stas00 commented 2 years ago

@stas00 , i would like to contribute this to the bloom-inference branch if its all right?

well, ideally all this should go directly to https://github.com/huggingface/transformers/tree/main/examples/research_projects/bloom-inference (the last section doesn't exist yet)

so the bloom-inference branch here should be moved there as well.

Does your code depend on the script under the bloom-inference branch? If not, perhaps open a separate PR into transformers and tag me on it?

At some point I will be doing the same for the bloom-inference branch

mayank31398 commented 2 years ago

Well no @stas00 , but it has a lot of duplicate code for now. That's why re-using the same methods across scripts would be better. Also, I am not able to use DS-inference with batch size > 1, I still get illegal memory access. After the DS fix, batch size = 1 started working.

Is it possible this is cause by CUDA version = 11.6 ( i am using). What is the CUDA environment used by you guys? Also, is Pytorch built from source and which version?

mayank31398 commented 2 years ago

Also, the memory leak in HF accelerate is not seen by @sgugger , so not sure why it is happening with my environment.

stas00 commented 2 years ago

I suppose we could start turning the scripts into small libraries that the scripts would pull in.

Would it help if I merged the bloom-inference branch, you re-based it and then started converting scripts into libs and re-using the code?