Open kalradivyanshu opened 3 months ago
Technically there shouldn't be any issues I think, since that LLaMA-3 has no architectural difference from LLaMA-2. Will try to add that tomorrow.
Thank you for your prompt response! I am broadly interested in having support for more models in DistServe like Phi-3, in general can you tell me what all steps will I have to take to add a new model into DistServe? Further-more can I have a DistServe server running that has multiple models loaded? And then I add which model I want to infer in the infer request? (like how triton, vllm support multiple models)
Thank you a lot for your attention and enthusiasm for this project.
The architecture of DistServe can be divided into two parts: the control plane, and the data plane. The former one is responsible for deciding "which request to serve" and is where we implement our "disaggregation" idea, and the latter one performs calculations. This repo contains code for the control plane. For the data plane, in order to achieve the state-of-the-art (SOTA) performance, DistServe utilizes a pure C++/CUDA implementation, SwiftTransformer.
Generally speaking, the following steps are necessary for adding support for a new model:
ModelConfig
in distserve/config.py
, and distserve/tokenizer.py
.distserve/downloader/converter.py
.If you wish to support a model that has exactly the same architecture as LLaMA2, congratulations, you can just replace model_type
in the model's config.json
(you should have that after downloading from HuggingFace) to llama
and it will work.
Frankly speaking, I am not really satisfied with DistServe's current implementation of the data plane. Despite delivering high performance (up to ~5% speedup compared to PyTorch version on small models (7b), and ~1% speedup on large models), the pure C++/CUDA implementation is hard to develop and maintain. Besides, this creates a barrier between DistServe and the whole ecosystem, e.g., we cannot use operator and kernels from PyTorch or OpenAI Triton, or it takes much effort to add support for a new model. A better solution could be leveraging PyTorch with OpenAI Triton, which achieves nearly equivalent performance while reducing the LoC (line-of-code) of the data plane by 10x compared to SwiftTransformer. SwiftLLM uses this approach.
Currently DistServe does not support serving multiple models simultaneously. A workaround could be to start multiple DistServe instances.
Oh wow, thank you for such a detailed reply. How mature is SwiftLLM? I briefly went through its code, and I can see you have added things like KV cache swap and separate prefill and decode stages, so I am guessing the plan is to eventually swap SwiftTransformer with SwiftLLM in DistServe? How much more work is needed for this to happen?
Thank you for all your hard work!
Oh wow, thank you for such a detailed reply. How mature is SwiftLLM? I briefly went through its code, and I can see you have added things like KV cache swap and separate prefill and decode stages, so I am guessing the plan is to eventually swap SwiftTransformer with SwiftLLM in DistServe? How much more work is needed for this to happen?
Thank you for all your hard work!
I just throw SwiftLLM as an example of using PyTorch + Triton... SwiftLLM is currently able to launch an API server and perform online serving, but currently we have no plan of migrating DistServe to SwiftLLM
Oh ok. How hard will it be to separate prefill stage and decode stage to separate GPUs in SwiftLLM? My main thing is I think it will be easier to add new models, make changes in SwiftLLM. And I do want a DistServe style segregation of prefill and decode. Any tips on how I should proceed will be appreciated, thanks!
I have just checked that DistServe should be able to serve LLaMA3 without any modifications on code. Due to restrictions proposed by Meta, I cannot access to meta-llama/Meta-Llama-3-8B
so I ran DistServe on SchizoDev/Llama3-8b-CunnyGPT-16bit
and everything works fine. It should support meta-llama/Meta-Llama-3-8B
as long as Meta provides pytorch_model.bin
or a series of pytorch_model-XXXXX-of-XXXXX.bin.
s (safetensors format is not supported yet).
Is this system now support LLaMA-1 architecture?
Hey, love the work you guys have done on DistServe and SwiftTransformer. As far as I can tell it supports Llama-2. How hard will adding Llama-3 models be? I specifically want support for https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.
Any guidance will be really helpful. Thanks!