Open maxin9966 opened 1 week ago
Please checkout the NOTE part in https://lmdeploy.readthedocs.io/en/latest/get_started/get_started.html The KV cache is allocated according to the ratio of the FREE GPU mem after the model is loaded.
@lvhan028 --cache-max-entry-count 0.1
I set it to 0.1, and the two graphics cards each take up over 7G. When I set tp=1, the graphics cards take up over 7G.
Assume ONE GPU total memory is T
, the model memory footprint is S
, the hyper-parameter --cache-max-entry-count
is lambda
and the GPU number is P
in tensor parallelism.
According to LMDeploy memory management policy, lambda * (T-S/P)
will be allocated for KV cache on each GPU, no matter whether the model is quantized or not.
@lvhan028 I know the formula, but the actual measurement does not match the formula. For the same command, only changing tp. tp=1, the single card uses more than 7G of VRAM, tp=2 with dual cards, each card uses more than 7G.
Am I missing some startup parameters?
The token_embedding
and lm_head
weights are not splitted and distributed across GPUs.
Each GPU owns a copy.
PR #2252 resolves it and will be released in next week.
May try the v0.6.0
Motivation
CUDA_VISIBLE_DEVICES=3,4 lmdeploy serve api_server /home/ma/work/modelscope/glm-4-9b-chat-GPTQ-Int4 --backend turbomind --model-format gptq --server-port 11231 --tp 2 --session-len 16500 --cache-max-entry-count 0.1 --model-name gpt --max-batch-size 64
Regarding the issue of memory usage when --tp 2 is enabled, why does the memory usage double when tp equals 2? Each GPU is loading a full model individually. Shouldn't the model be split and distributed across different GPU instances instead?
Related resources
No response
Additional context
No response