ZJUICI / text-generation-inference

Large Language Model Text Generation Inference
https://huggingface.github.io/text-generation-inference/
Apache License 2.0
0 stars 0 forks source link

Adapt to the latest version of exllama or exllamav2? #2

Open zTaoplus opened 1 year ago

zTaoplus commented 1 year ago

Feature request

Currently tgi exllama version cannot support act order with sharded GPU,

Motivation

I tested the commit:3b013cd53c7d413cf99ca04c7c28dd5c95117c0d of exllama,

command

python example_chatbot.py -d <model path> -un "Jeff" -p prompt_chatbort.txt  -gs 10,12

this can loaded the model into the 2 gpu

chatbot result: image

nvidia-smi result: image

Your contribution

https://github.com/turboderp/exllama/issues/276

edwardzjl commented 1 year ago

According to the linked issue it is possible to leverage exllama v1 on multiple GPU?

zTaoplus commented 1 year ago

According to the linked issue it is possible to leverage exllama v1 on multiple GPU?

I have currently only tested the code with exllama v1, which can support multi-GPU running. However, whether it can be integrated into TGI still needs further experimentation.

edwardzjl commented 1 year ago

Bloody cool, really expecting this.

zTaoplus commented 1 year ago

Unfortunately, I used the new CUDA extension, exllama v1, to rebuild TGI, but it did not achieve the expected results. Additionally, I observed that exllama's multi-GPU support is implemented using the "device_map" in "from_pretrained" from Transformers, while TGI uses tensor parallelism for multi-gpu inference, which is currently not supported in exllama v1. sorry, it's not that it doesn't support, but it doesn't support the "act order" mode.

see https://github.com/ZJUICI/text-generation-inference/issues/1#issuecomment-1724779706

zTaoplus commented 1 year ago

Next, I will attempt to integrate the relevant code from exllama v2 into TGI to see if it's possible.