NetEase-FuXi / EETQ

Easy and Efficient Quantization for Transformers
Apache License 2.0
174 stars 14 forks source link

Why does EETQ take up all VRAM #3

Closed RonanKMcGovern closed 9 months ago

RonanKMcGovern commented 9 months ago

I'm running on runpod with an A6000 and TGI docker image:

--model-id mistralai/Mistral-7B-Instruct-v0.1 --trust-remote-code --port 8080 --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 --quantize eetq

When the model is loaded, GPU Memory Used goes up to 97%. I would expect, for an 8-bit model that GPU usage would only be about 7GB, which is around 15% of an A6000's 48 GB VRAM.

Lin-sudo commented 9 months ago

This is because TGI activated eetq and vllm simultaneously. And vllm will allocate 90% of the GPU memory and kv_cache blocks for model inference.

mirror issue: https://github.com/vllm-project/vllm/issues/601

SidaZh commented 9 months ago

This issue is caused by vllm rather than eetq, so close this issue.