Closed codonna9 closed 8 months ago
Did you solve this question?
No, I haven't been able to solve this. I guess, we'll have to wait for the team to release some model parallel inference code but it's probably not easy to do
Hi, please consider using quantization. See the following for example: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/demos/multi_turn_mm_box.py#L77
If you don't wanna quantize the model: We have not tried with 24G-memory GPUs, but as a rough estimation, the GPU memory cost for hosting SPHINX on two GPUs should be close to 24G (without quantization). So you may be able to successfully run it without quantization after some optimization, but overall it is really extreme.
@quizD @codonna9 Please refer to issue 114. https://github.com/Alpha-VLLM/LLaMA2-Accessory/issues/114
Hi, please consider using quantization. See the following for example: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/demos/multi_turn_mm_box.py#L77
If you don't wanna quantize the model: We have not tried with 24G-memory GPUs, but as a rough estimation, the GPU memory cost for hosting SPHINX on two GPUs should be close to 24G (without quantization). So you may be able to successfully run it without quantization after some optimization, but overall it is really extreme.
Thanks a lot for the quantization suggestion. I'll try that and I'll probably explore ways to dispatch the model weights to multiple GPUs like what they do here: https://huggingface.co/THUDM/cogvlm-chat-hf
@quizD @codonna9 Please refer to issue 114. #114
Thanks a lot! I'll follow that issue
Hello, Thank you so much for releasing the models, paper and code. I tried Sphinx's demo and I'm very impressed with the demo results. However, when I tried running the file "inference.py" with 2 A30 GPU (each has 24 GB RAM) and MODEL_PARALLEL_SIZE = 2, I got torch.cuda.OutOfMemoryError.
Are there any other ways I can run Sphinx models with 2 A30 GPU?