Open zTaoplus opened 1 year ago
According to the linked issue it is possible to leverage exllama v1 on multiple GPU?
According to the linked issue it is possible to leverage exllama v1 on multiple GPU?
I have currently only tested the code with exllama v1, which can support multi-GPU running. However, whether it can be integrated into TGI still needs further experimentation.
Bloody cool, really expecting this.
Unfortunately, I used the new CUDA extension, exllama v1, to rebuild TGI, but it did not achieve the expected results. Additionally, I observed that exllama's multi-GPU support is implemented using the "device_map" in "from_pretrained" from Transformers, while TGI uses tensor parallelism for multi-gpu inference, which is currently not supported in exllama v1.
sorry, it's not that it doesn't support, but it doesn't support the "act order" mode.
see https://github.com/ZJUICI/text-generation-inference/issues/1#issuecomment-1724779706
Next, I will attempt to integrate the relevant code from exllama v2 into TGI to see if it's possible.
Feature request
Currently tgi exllama version cannot support act order with sharded GPU,
Motivation
I tested the commit:3b013cd53c7d413cf99ca04c7c28dd5c95117c0d of exllama,
command
this can loaded the model into the 2 gpu
chatbot result:
nvidia-smi result:
Your contribution
https://github.com/turboderp/exllama/issues/276