MiuLab / Taiwan-LLM

Traditional Mandarin LLMs for Taiwan
https://twllm.com
Apache License 2.0
965 stars 84 forks source link

Minimum GPU device requirement for inference (with OOM issue) #24

Closed nigue3025 closed 10 months ago

nigue3025 commented 10 months ago

Hi, I am a very new bee of the player I got just single RTX2080ti(with 11GB ram only) to run the model with text-generation-inference After executing the .sh file, the GPU ram consumption gradually increase. And the message "waiting for shard to be ready ... rank=0" constantly appears. Finally, it ends up with the message "torch.cuda.OutOfMemoryError: CUDA out of memory..." I attempted to set PYTORCH_CUDA_ALLOC_CONF to different value, but it still not work. Is that mean I have to update my card with larger ram (e.g. rtx4090 with 24GB) if I insist to run this 13B model (other than 4bits-GQTQ model) with GPU? It would be appreciated to have any advice.

PenutChen commented 10 months ago

I recommend 4-bit quantization when using low-memory GPUs, e.g.

docker run --gpus 'device=0' -p 8085:80 \
    -v ./Models:/Models \
    ghcr.io/huggingface/text-generation-inference:sha-5485c14 \
    --model-id /Models/TaiwanLlama-13B \
    --quantize "bitsandbytes-nf4" \
    --max-input-length 1500 \
    --max-total-tokens 2000 \
    --max-batch-prefill-tokens 1500 \
    --max-batch-total-tokens 2000 \
    --max-best-of 1 \
    --max-concurrent-requests 128

Since I don't possess an 11GB RTX GPU, I simulate this situation using the --cuda-memory-fraction 0.45 parameter. In this scenario, it consumes about 10,000 MiB of my GPU memory. I believe the paged attention mechanism of TGI will consume all of the remaining GPU memory.

nigue3025 commented 10 months ago

I recommend 4-bit quantization when using low-memory GPUs, e.g.

docker run --gpus 'device=0' -p 8085:80 \
    -v ./Models:/Models \
    ghcr.io/huggingface/text-generation-inference:sha-5485c14 \
    --model-id /Models/TaiwanLlama-13B \
    --quantize "bitsandbytes-nf4" \
    --max-input-length 1500 \
    --max-total-tokens 2000 \
    --max-batch-prefill-tokens 1500 \
    --max-batch-total-tokens 2000 \
    --max-best-of 1 \
    --max-concurrent-requests 128

Since I don't possess an 11GB RTX GPU, I simulate this situation using the --cuda-memory-fraction 0.45 parameter. In this scenario, it consumes about 10,000 MiB of my GPU memory. I believe the paged attention mechanism of TGI will consume all of the remaining GPU memory.

Great! it works! BTW. I slightly modified the parameters as follows to ensure RTX2080ti work --max-input-length 1000 \ --max-total-tokens 1500 \ --max-batch-prefill-tokens 1000 \ --max-batch-total-tokens 1500 \

Once again, thanks for your kind help!!