[Feature] when --tp 2 - Githubissues

InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

https://lmdeploy.readthedocs.io/en/latest/

Apache License 2.0

4.18k stars 376 forks source link

[Feature] when --tp 2 #2423

Open maxin9966 opened 1 week ago

maxin9966 commented 1 week ago

Motivation

CUDA_VISIBLE_DEVICES=3,4 lmdeploy serve api_server /home/ma/work/modelscope/glm-4-9b-chat-GPTQ-Int4 --backend turbomind --model-format gptq --server-port 11231 --tp 2 --session-len 16500 --cache-max-entry-count 0.1 --model-name gpt --max-batch-size 64

Regarding the issue of memory usage when --tp 2 is enabled, why does the memory usage double when tp equals 2? Each GPU is loading a full model individually. Shouldn't the model be split and distributed across different GPU instances instead?

Related resources

No response

Additional context

No response

lvhan028 commented 1 week ago

Please checkout the NOTE part in https://lmdeploy.readthedocs.io/en/latest/get_started/get_started.html The KV cache is allocated according to the ratio of the FREE GPU mem after the model is loaded.

maxin9966 commented 1 week ago

@lvhan028 --cache-max-entry-count 0.1

I set it to 0.1, and the two graphics cards each take up over 7G. When I set tp=1, the graphics cards take up over 7G.

lvhan028 commented 1 week ago

Assume ONE GPU total memory is T, the model memory footprint is S, the hyper-parameter --cache-max-entry-count is lambda and the GPU number is P in tensor parallelism.

According to LMDeploy memory management policy, lambda * (T-S/P) will be allocated for KV cache on each GPU, no matter whether the model is quantized or not.

maxin9966 commented 1 week ago

@lvhan028 I know the formula, but the actual measurement does not match the formula. For the same command, only changing tp. tp=1, the single card uses more than 7G of VRAM, tp=2 with dual cards, each card uses more than 7G.

Am I missing some startup parameters?

lvhan028 commented 1 week ago

The token_embedding and lm_head weights are not splitted and distributed across GPUs. Each GPU owns a copy. PR #2252 resolves it and will be released in next week.

lvhan028 commented 7 hours ago

May try the v0.6.0