[Bug] 运行glm4-9b的时候报错

WCwalker commented 4 months ago

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

[1] 2771397 floating point exception lmdeploy serve api_server --backend turbomind --model-name chatglm4 --tp 4

Reproduction

lmdeploy serve api_server /home/mingqiang/model/model_file/origin_model/glm-4-9b-chat --backend turbomind --model-name chatglm4 --tp 4 --server-port 10000 --cache-max-entry-count 0.1

Environment

sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA A800 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Debian 10.2.1-6) 10.2.1 20210110
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.1+
transformers: 4.40.1
gradio: Not Found
fastapi: 0.110.2
pydantic: 2.7.1
triton: 2.2.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     PIX     PIX     0-27,56-83      0               N/A
GPU1    PIX      X      PIX     PIX     0-27,56-83      0               N/A
GPU2    PIX     PIX      X      PIX     0-27,56-83      0               N/A
GPU3    PIX     PIX     PIX      X      0-27,56-83      0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

No response

lvhan028 commented 4 months ago

In the glm4 model, there are only 2 key-value (KV) heads available, making it impossible to evenly partition among 4 GPUs. Please set tp=2 or tp=1. The chat template name is supposed to be glm4 in the latest version.

SquareMask commented 4 months ago

In the glm4 model, there are only 2 key-value (KV) heads available, making it impossible to evenly partition among 4 GPUs. Please set tp=2 or tp=1. The chat template name is supposed to be glm4 in the latest version.

Excuse me, I have a parameter called 'tensor-parallel-size' in vllm that can be set to 4 or 8 to run glm-9b. What is the difference between this and 'tp'?

lvhan028 commented 4 months ago

Sorry, I don't know how vllm implements it.

SquareMask commented 4 months ago

Sorry, I don't know how vllm implements it.

Understand, I currently have four 3080ti graphics cards with 12GB of VRAM each. I want to start using the lmdeploy serve api_server method. If I use the --tp 4 option, it will report a floating point exception, and if I use --tp 2, it will report insufficient VRAM. Are there any solutions? Thank you

lvhan028 commented 4 months ago

Have you tried "--cache-max-entry-count 0.1" when using "--tp 2"?

SquareMask commented 4 months ago

Have you tried "--cache-max-entry-count 0.1" when using "--tp 2"?

I tried --cache-max-entry-count 0.01 and still failed, but "lmdeploy chat /app/models/glm-4-9b-chat --tp 2" can actually work, just cannt use lmdeploy serve api_server. Thanks

lvhan028 commented 4 months ago

Sorry, I don't know how vllm implements it.

Understand, I currently have four 3080ti graphics cards with 12GB of VRAM each. I want to start using the lmdeploy serve api_server method. If I use the --tp 4 option, it will report a floating point exception, and if I use --tp 2, it will report insufficient VRAM. Are there any solutions? Thank you

Could you add --log-level INFO when you launch the server and share the error log?

SquareMask commented 4 months ago

Sorry, I don't know how vllm implements it.

Understand, I currently have four 3080ti graphics cards with 12GB of VRAM each. I want to start using the lmdeploy serve api_server method. If I use the --tp 4 option, it will report a floating point exception, and if I use --tp 2, it will report insufficient VRAM. Are there any solutions? Thank you

Could you add --log-level INFO when you launch the server and share the error log?

CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server /app/models/glm-4-9b-chat --server-port 11434 --model-name glm4 --tp 2 \ --cache-max-entry-count 0.01 --log-level INFO log.txt

lvhan028 commented 4 months ago

SquareMask commented 4 months ago

Why did this 9b model use two cards, each with 70GB of VRAM = =

lvhan028 commented 4 months ago

I used the default --cache-max-entry-count 0.8

lvhan028 commented 4 months ago

It's A100 80G. And your GPU is A800 80G. The memory is quite enough to launch the service with default value. I have no idea why it doesn't work at your side. I'd better add INFO logs when malloc memory

SquareMask commented 4 months ago

It's A100 80G. And your GPU is A800 80G. The memory is quite enough to launch the service with default value. I have no idea why it doesn't work at your side. I'd better add INFO logs when malloc memory

Mine is 3080ti 12G *2, I suppose it's enough for a 9b model, as I can use "lmdeploy chat" to lanuch the model and chat . It's quite strange that use "lmdeploy serve" needs so much memory

lvhan028 commented 4 months ago

Oh, you are not the user who opened this issue :joy:

Can you try "--max-batch-size 1" at your side? "lmdeploy chat" set "--max-batch-size" 1 as default while "lmdeploy serve" makes it 128

SquareMask commented 4 months ago

Oh, you are not the user who opened this issue 😂

Can you try "--max-batch-size 1" at your side? "lmdeploy chat" set "--max-batch-size" 1 as default while "lmdeploy serve" makes it 128

(lmdeploy) (base) root@172-16-103-221:/app/code# CUDA_VISIBLE_DEVICES=5,6 lmdeploy serve api_server /app/models/glm-4-9b-chat --server-port 11434 --model-name glm4 --tp 2 \

--max-batch-size 1 --cache-max-entry-count 0.1 --log-level INFO 2024-07-26 12:13:24,537 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='glm4', model_format=None, tp=2, session_len=None, max_batch_size=1, cache_max_entry_count=0.1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-07-26 12:13:24,537 - lmdeploy - INFO - input chat_template_config=None 2024-07-26 12:13:24,599 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='glm4', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-07-26 12:13:24,599 - lmdeploy - INFO - model_source: hf_model 2024-07-26 12:13:24,599 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect 2024-07-26 12:13:25,948 - lmdeploy - INFO - model_config:

[llama] model_name = glm4 model_arch = ChatGLMModel tensor_para_size = 2 head_num = 32 kv_head_num = 2 vocab_size = 151552 num_layer = 40 inter_size = 13696 norm_eps = 1.5625e-07 attn_bias = 1 start_id = 0 end_id = 151329 session_len = 131080 weight_type = bf16 rotary_embedding = 64 rope_theta = 5000000.0 size_per_head = 128 group_size = 0 max_batch_size = 1 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.1 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 17 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 131072 rope_scaling_factor = 0.0 use_dynamic_ntk = 0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =

[TM][WARNING] [LlamaTritonModel] max_context_token_num = 131080. 2024-07-26 12:13:27,045 - lmdeploy - WARNING - get 643 model params 2024-07-26 12:13:35,806 - lmdeploy - INFO - updated backend_config=TurbomindEngineConfig(model_name='glm4', model_format=None, tp=2, session_len=None, max_batch_size=1, cache_max_entry_count=0.1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][WARNING] Device 1 peer access Device 0 is not available. [WARNING] gemm_config.in is not found; using default GEMM algo [TM][INFO] NCCL group_id = 0 [WARNING] gemm_config.in is not found; using default GEMM algo [TM][INFO] NCCL group_id = 0 [TM][INFO] [BlockManager] block_size = 1 MB [TM][INFO] [BlockManager] block_size = 1 MB [TM][INFO] [BlockManager] max_block_count = 115 [TM][INFO] [BlockManager] max_block_count = 115 [TM][INFO] [BlockManager] chunk_size = 115 [TM][INFO] [BlockManager] chunk_size = 115 [TM][WARNING] No enough blocks for session_len (131080), session_len truncated to 7360. Exception in thread Thread-6: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner Exception in thread Thread-7: self.run() Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 870, in run File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance model_inst = self.tm_model.model_comm.create_model_instance( RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231

self.run()

File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance model_inst = self.tm_model.model_comm.create_model_instance( RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231

= = , just cannt make it work

lvhan028 commented 4 months ago

It looks that we really need to make efforts on memory management. Sorry for the inconvenient.

lzhangzz commented 4 months ago

May be fixed by #2201

InternLM / lmdeploy