Closed RonanKMcGovern closed 9 months ago
This is because TGI activated eetq and vllm simultaneously. And vllm will allocate 90% of the GPU memory and kv_cache blocks for model inference.
mirror issue: https://github.com/vllm-project/vllm/issues/601
This issue is caused by vllm rather than eetq, so close this issue.
I'm running on runpod with an A6000 and TGI docker image:
When the model is loaded, GPU Memory Used goes up to 97%. I would expect, for an 8-bit model that GPU usage would only be about 7GB, which is around 15% of an A6000's 48 GB VRAM.