Open WCwalker opened 4 months ago
In the glm4 model, there are only 2 key-value (KV) heads available, making it impossible to evenly partition among 4 GPUs. Please set tp=2 or tp=1.
The chat template name is supposed to be glm4
in the latest version.
In the glm4 model, there are only 2 key-value (KV) heads available, making it impossible to evenly partition among 4 GPUs. Please set tp=2 or tp=1. The chat template name is supposed to be
glm4
in the latest version.
Excuse me, I have a parameter called 'tensor-parallel-size' in vllm that can be set to 4 or 8 to run glm-9b. What is the difference between this and 'tp'?
Sorry, I don't know how vllm implements it.
Sorry, I don't know how vllm implements it.
Understand, I currently have four 3080ti graphics cards with 12GB of VRAM each. I want to start using the lmdeploy serve api_server method. If I use the --tp 4 option, it will report a floating point exception, and if I use --tp 2, it will report insufficient VRAM. Are there any solutions? Thank you
Have you tried "--cache-max-entry-count 0.1" when using "--tp 2"?
Have you tried "--cache-max-entry-count 0.1" when using "--tp 2"?
I tried --cache-max-entry-count 0.01 and still failed, but "lmdeploy chat /app/models/glm-4-9b-chat --tp 2" can actually work, just cannt use lmdeploy serve api_server. Thanks
Sorry, I don't know how vllm implements it.
Understand, I currently have four 3080ti graphics cards with 12GB of VRAM each. I want to start using the lmdeploy serve api_server method. If I use the --tp 4 option, it will report a floating point exception, and if I use --tp 2, it will report insufficient VRAM. Are there any solutions? Thank you
Could you add --log-level INFO when you launch the server and share the error log?
Sorry, I don't know how vllm implements it.
Understand, I currently have four 3080ti graphics cards with 12GB of VRAM each. I want to start using the lmdeploy serve api_server method. If I use the --tp 4 option, it will report a floating point exception, and if I use --tp 2, it will report insufficient VRAM. Are there any solutions? Thank you
Could you add --log-level INFO when you launch the server and share the error log?
CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server /app/models/glm-4-9b-chat --server-port 11434 --model-name glm4 --tp 2 \ --cache-max-entry-count 0.01 --log-level INFO log.txt
Why did this 9b model use two cards, each with 70GB of VRAM = =
I used the default --cache-max-entry-count 0.8
It's A100 80G. And your GPU is A800 80G. The memory is quite enough to launch the service with default value. I have no idea why it doesn't work at your side. I'd better add INFO logs when malloc memory
It's A100 80G. And your GPU is A800 80G. The memory is quite enough to launch the service with default value. I have no idea why it doesn't work at your side. I'd better add INFO logs when malloc memory
Mine is 3080ti 12G *2, I suppose it's enough for a 9b model, as I can use "lmdeploy chat" to lanuch the model and chat . It's quite strange that use "lmdeploy serve" needs so much memory
Oh, you are not the user who opened this issue :joy:
Can you try "--max-batch-size 1" at your side? "lmdeploy chat" set "--max-batch-size" 1 as default while "lmdeploy serve" makes it 128
Oh, you are not the user who opened this issue 😂
Can you try "--max-batch-size 1" at your side? "lmdeploy chat" set "--max-batch-size" 1 as default while "lmdeploy serve" makes it 128
(lmdeploy) (base) root@172-16-103-221:/app/code# CUDA_VISIBLE_DEVICES=5,6 lmdeploy serve api_server /app/models/glm-4-9b-chat --server-port 11434 --model-name glm4 --tp 2 \
--max-batch-size 1 --cache-max-entry-count 0.1 --log-level INFO 2024-07-26 12:13:24,537 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='glm4', model_format=None, tp=2, session_len=None, max_batch_size=1, cache_max_entry_count=0.1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-07-26 12:13:24,537 - lmdeploy - INFO - input chat_template_config=None 2024-07-26 12:13:24,599 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='glm4', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-07-26 12:13:24,599 - lmdeploy - INFO - model_source: hf_model 2024-07-26 12:13:24,599 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect 2024-07-26 12:13:25,948 - lmdeploy - INFO - model_config:
[llama] model_name = glm4 model_arch = ChatGLMModel tensor_para_size = 2 head_num = 32 kv_head_num = 2 vocab_size = 151552 num_layer = 40 inter_size = 13696 norm_eps = 1.5625e-07 attn_bias = 1 start_id = 0 end_id = 151329 session_len = 131080 weight_type = bf16 rotary_embedding = 64 rope_theta = 5000000.0 size_per_head = 128 group_size = 0 max_batch_size = 1 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.1 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 17 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 131072 rope_scaling_factor = 0.0 use_dynamic_ntk = 0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =
[TM][WARNING] [LlamaTritonModel] max_context_token_num
= 131080.
2024-07-26 12:13:27,045 - lmdeploy - WARNING - get 643 model params
2024-07-26 12:13:35,806 - lmdeploy - INFO - updated backend_config=TurbomindEngineConfig(model_name='glm4', model_format=None, tp=2, session_len=None, max_batch_size=1, cache_max_entry_count=0.1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[TM][WARNING] Device 0 peer access Device 1 is not available.
[TM][WARNING] Device 1 peer access Device 0 is not available.
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [BlockManager] block_size = 1 MB
[TM][INFO] [BlockManager] block_size = 1 MB
[TM][INFO] [BlockManager] max_block_count = 115
[TM][INFO] [BlockManager] max_block_count = 115
[TM][INFO] [BlockManager] chunk_size = 115
[TM][INFO] [BlockManager] chunk_size = 115
[TM][WARNING] No enough blocks for session_len
(131080), session_len
truncated to 7360.
Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-7:
self.run()
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 870, in run
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance model_inst = self.tm_model.model_comm.create_model_instance( RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
= = , just cannt make it work
It looks that we really need to make efforts on memory management. Sorry for the inconvenient.
May be fixed by #2201
Checklist
Describe the bug
[1] 2771397 floating point exception lmdeploy serve api_server --backend turbomind --model-name chatglm4 --tp 4
Reproduction
lmdeploy serve api_server /home/mingqiang/model/model_file/origin_model/glm-4-9b-chat --backend turbomind --model-name chatglm4 --tp 4 --server-port 10000 --cache-max-entry-count 0.1
Environment
Error traceback
No response