Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.63k stars 168 forks source link

OutOfMemoryError when running Sphinx with 2 A30 GPUs #110

Closed codonna9 closed 8 months ago

codonna9 commented 8 months ago

Hello, Thank you so much for releasing the models, paper and code. I tried Sphinx's demo and I'm very impressed with the demo results. However, when I tried running the file "inference.py" with 2 A30 GPU (each has 24 GB RAM) and MODEL_PARALLEL_SIZE = 2, I got torch.cuda.OutOfMemoryError.

Are there any other ways I can run Sphinx models with 2 A30 GPU?

quizD commented 8 months ago

Did you solve this question?

codonna9 commented 8 months ago

No, I haven't been able to solve this. I guess, we'll have to wait for the team to release some model parallel inference code but it's probably not easy to do

ChrisLiu6 commented 8 months ago

Hi, please consider using quantization. See the following for example: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/demos/multi_turn_mm_box.py#L77

If you don't wanna quantize the model: We have not tried with 24G-memory GPUs, but as a rough estimation, the GPU memory cost for hosting SPHINX on two GPUs should be close to 24G (without quantization). So you may be able to successfully run it without quantization after some optimization, but overall it is really extreme.

gaopengpjlab commented 8 months ago

@quizD @codonna9 Please refer to issue 114. https://github.com/Alpha-VLLM/LLaMA2-Accessory/issues/114

codonna9 commented 8 months ago

Hi, please consider using quantization. See the following for example: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/demos/multi_turn_mm_box.py#L77

If you don't wanna quantize the model: We have not tried with 24G-memory GPUs, but as a rough estimation, the GPU memory cost for hosting SPHINX on two GPUs should be close to 24G (without quantization). So you may be able to successfully run it without quantization after some optimization, but overall it is really extreme.

Thanks a lot for the quantization suggestion. I'll try that and I'll probably explore ways to dispatch the model weights to multiple GPUs like what they do here: https://huggingface.co/THUDM/cogvlm-chat-hf

codonna9 commented 8 months ago

@quizD @codonna9 Please refer to issue 114. #114

Thanks a lot! I'll follow that issue