InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.22k stars 381 forks source link

[Bug] 使用lmdeploy部署gemma2无法正常回复, #2107

Closed zhouyuustc closed 1 month ago

zhouyuustc commented 2 months ago

Checklist

Describe the bug

gemma2吐出的结果不准确,乱回复 昨天使用lmdeploy部署后,发现lmdeploy部署的gemma2推理能力很差,可以说是胡乱推理(例如1 ) 我之前一直用的一个openai.py脚本调用gemma2(例如2 ),我知道gemma2性能绝对没这么差,我测试了几个例子,发现gemma2仿佛从一个大学生变成了小学生

我不知道是不是我那里命令不对,lmdeploy是最新的0.5.1版本,看官网上PyTorch 支持的模型已经支持了gemma2(但我使用命令行 lmdeploy list只看到gemma未看到gemma2) 求助!

Reproduction

CUDA_VISIBLE_DEVICES=4 lmdeploy serve api_server /mnt/gemma2 --server-port 35554 --model-name gemma2 --session-len 8000 --max-batch-size 10 --log-level INFO

Environment

sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA Graphics Device
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.1+
transformers: 4.42.0
gradio: Not Found
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.2.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5 mlx5_6   mlx5_7  CPU Affinity    NUMA Affinity
GPU0     X      NV8     NV8     NV8     NV8     NV8     NV8     NV8     PXB     PXB     NODE    NODE    SYS     SYS    SYS      SYS     0-31,64-95      0
GPU1    NV8      X      NV8     NV8     NV8     NV8     NV8     NV8     PXB     PXB     NODE    NODE    SYS     SYS    SYS      SYS     0-31,64-95      0
GPU2    NV8     NV8      X      NV8     NV8     NV8     NV8     NV8     NODE    NODE    PXB     PXB     SYS     SYS    SYS      SYS     0-31,64-95      0
GPU3    NV8     NV8     NV8      X      NV8     NV8     NV8     NV8     NODE    NODE    PXB     PXB     SYS     SYS    SYS      SYS     0-31,64-95      0
GPU4    NV8     NV8     NV8     NV8      X      NV8     NV8     NV8     SYS     SYS     SYS     SYS     PXB     PXB    NODE     NODE    32-63,96-127    1
GPU5    NV8     NV8     NV8     NV8     NV8      X      NV8     NV8     SYS     SYS     SYS     SYS     PXB     PXB    NODE     NODE    32-63,96-127    1
GPU6    NV8     NV8     NV8     NV8     NV8     NV8      X      NV8     SYS     SYS     SYS     SYS     NODE    NODE   PXB      PXB     32-63,96-127    1
GPU7    NV8     NV8     NV8     NV8     NV8     NV8     NV8      X      SYS     SYS     SYS     SYS     NODE    NODE   PXB      PXB     32-63,96-127    1
mlx5_0  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     NODE    NODE    SYS     SYS    SYS      SYS
mlx5_1  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      NODE    NODE    SYS     SYS    SYS      SYS
mlx5_2  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     SYS     SYS    SYS      SYS
mlx5_3  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      SYS     SYS    SYS      SYS
mlx5_4  SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PIX    NODE     NODE
mlx5_5  SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX      X     NODE     NODE
mlx5_6  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE    X       PIX
mlx5_7  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE   PIX       X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

(gemma2) root@4034937c8c66:/mnt/gemma2# CUDA_VISIBLE_DEVICES=4 lmdeploy serve api_server /mnt/gemma2     --server-port 35554     --model-name gemma2    --session-len 8000      --max-batch-size 10      --log-level INFO
2024-07-23 11:05:44,155 - lmdeploy - WARNING - Fallback to pytorch engine because `/mnt/gemma2` not supported by turbomind engine.
2024-07-23 11:05:44,155 - lmdeploy - INFO - input backend=pytorch, backend_config=PytorchEngineConfig(model_name='gemma2', tp=1, session_len=8000, max_batch_size=10, cache_max_entry_count=0.8, eviction_type='recompute', prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=4096, thread_safe=False, enable_prefix_caching=False, device_type='cuda', download_dir=None, revision=None)
2024-07-23 11:05:44,155 - lmdeploy - INFO - input chat_template_config=None
2024-07-23 11:05:44,909 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='gemma', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-07-23 11:05:45,140 - lmdeploy - INFO - Checking environment for PyTorch Engine.
2024-07-23 11:05:47,310 - lmdeploy - INFO - Checking model.
2024-07-23 11:05:47,310 - lmdeploy - WARNING - LMDeploy requires transformers version: [4.33.0 ~ 4.41.2], but found version: 4.42.0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 12/12 [02:31<00:00, 12.64s/it]
2024-07-23 11:08:19,192 - lmdeploy - INFO - Patching model.
2024-07-23 11:08:19,817 - lmdeploy - INFO - build CacheEngine with config:CacheConfig(block_size=64, num_cpu_blocks=178, num_gpu_blocks=741, window_size=-1, cache_max_entry_count=0.8, max_prefill_token_num=4096, enable_prefix_caching=False)
2024-07-23 11:08:21,561 - lmdeploy - INFO - updated backend_config=PytorchEngineConfig(model_name='gemma2', tp=1, session_len=8000, max_batch_size=10, cache_max_entry_count=0.8, eviction_type='recompute', prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=4096, thread_safe=False, enable_prefix_caching=False, device_type='cuda', download_dir=None, revision=None)
HINT:    Please open http://0.0.0.0:35554 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:35554 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:35554 in a browser for detailed api usage!!!
INFO:     Started server process [1418]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:35554 (Press CTRL+C to quit)
2024-07-23 11:08:26,619 - lmdeploy - INFO - prompt='<start_of_turn>user\n合肥=100,北京=101,上海=99,天津=97,南京=98,西安=102,按城市得分从小到大排序,(使用中文回答)<end_of_turn>\n<start_of_turn>model\n', gen_config=EngineGenerationConfig(n=1, max_new_tokens=200, top_p=1.0, top_k=40, temperature=1.0, repetition_penalty=1.0, ignore_eos=False, random_seed=1386618144973828104, stop_words=[107], bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[2, 106, 1645, 108, 235697, 238121, 235293, 235274, 235276, 235276, 235365, 35354, 235293, 235274, 235276, 235274, 235365, 39606, 235293, 235315, 235315, 235365, 123866, 235293, 235315, 235324, 235365, 102731, 235293, 235315, 235321, 235365, 140783, 235293, 235274, 235276, 235284, 235365, 236784, 29450, 201873, 130038, 214210, 76074, 235365, 235538, 7060, 50039, 33226, 235536, 107, 108, 106, 2516, 108], adapter_name=None.
2024-07-23 11:08:26,619 - lmdeploy - INFO - session_id=1, history_tokens=0, input_tokens=55, max_new_tokens=200, seq_start=True, seq_end=True, step=0, prep=True
INFO:     36.33.26.136:48485 - "POST /v1/chat/completions HTTP/1.1" 200 OK
zhouyuustc commented 2 months ago

补充: 启动命令中加上--backend pytorch效果也不行,还是胡乱推理

另外补充测试case 使用lmdeploy部署的openai接口服务:image 使用网上其他openai脚本的接口服务:

image
zhouyuustc commented 2 months ago

对于更复杂的测试case效果更明显, lmdeploy部署的openai接口服务返回结果

image

网上其他openai脚本返回结果

image
grimoire commented 1 month ago

Can not reproduce with main branch.

grimoire commented 1 month ago

--model-name gemma2 can be removed

zhouyuustc commented 1 month ago

--model-name gemma2 can be removed 没有model name的话,接口传参的这个model参数传什么呢?以前是传gemma2,现在不传会报错

image

日志: (gemma2) root@4034937c8c66:/mnt# CUDA_VISIBLE_DEVICES=4 lmdeploy serve api_server /mnt/gemma2 \ --server-port 35554 \ --session-len 8000 \ --max-batch-size 10 \ --log-level INFO 2024-07-23 15:04:10,884 - lmdeploy - WARNING - Fallback to pytorch engine because /mnt/gemma2 not supported by turbomind engine. 2024-07-23 15:04:10,884 - lmdeploy - INFO - input backend=pytorch, backend_config=PytorchEngineConfig(model_name=None, tp=1, session_len=8000, max_batch_size=10, cache_max_entry_count=0.8, eviction_type='recompute', prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=4096, thread_safe=False, enable_prefix_caching=False, device_type='cuda', download_dir=None, revision=None) 2024-07-23 15:04:10,884 - lmdeploy - INFO - input chat_template_config=None 2024-07-23 15:04:11,637 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='gemma', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-07-23 15:04:11,826 - lmdeploy - INFO - Checking environment for PyTorch Engine. 2024-07-23 15:04:13,425 - lmdeploy - INFO - Checking model. 2024-07-23 15:04:13,425 - lmdeploy - WARNING - LMDeploy requires transformers version: [4.33.0 ~ 4.41.2], but found version: 4.42.2 Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 12/12 [00:11<00:00, 1.06it/s] 2024-07-23 15:04:24,950 - lmdeploy - INFO - Patching model. 2024-07-23 15:04:25,547 - lmdeploy - INFO - build CacheEngine with config:CacheConfig(block_size=64, num_cpu_blocks=178, num_gpu_blocks=741, window_size=-1, cache_max_entry_count=0.8, max_prefill_token_num=4096, enable_prefix_caching=False) 2024-07-23 15:04:27,259 - lmdeploy - INFO - updated backend_config=PytorchEngineConfig(model_name=None, tp=1, session_len=8000, max_batch_size=10, cache_max_entry_count=0.8, eviction_type='recompute', prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=4096, thread_safe=False, enable_prefix_caching=False, device_type='cuda', download_dir=None, revision=None) HINT: Please open http://0.0.0.0:35554 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:35554 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:35554 in a browser for detailed api usage!!! INFO: Started server process [4217] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:35554 (Press CTRL+C to quit) INFO: 36.33.26.136:46427 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 36.33.26.136:50307 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 36.33.26.136:3201 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 36.33.26.136:50311 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 36.33.26.136:50311 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity INFO: 36.33.26.136:50329 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 36.33.26.136:50329 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 36.33.26.136:50351 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 36.33.26.136:33694 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 36.33.26.136:50382 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 36.33.26.136:50382 - "POST /v1/chat/completions HTTP/1.1" 200 OK

grimoire commented 1 month ago

fill the model in the input json or just set --model-name=gemma. gemma and gemma2 share same template.