[Bug] v1/chat/completions请求无法成功返回

lai-serena commented 1 month ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

在容器内启动，并开放了端口6002后，可以打开网页

通过curl http://[ip]:6002/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen2-7B-Instruct-AWQ", "prompt": "two steps to build a house:" }'是有返回结果的：

但是通过curl http://[ip]:6002/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen2-7B-Instruct-AWQ", "messages": [ { "role": "user", "content": "two steps to build a house:" } ] }' 一直都在请求中既没有报错，也没有返回结果，后台查看也没有任何信息，这是什么原因？

Reproduction

我的启动模型命令：lmdeploy serve api_server /workspace/model/Qwen2-7B-Instruct-AWQ --model-name Qwen2-7B-Instruct-AWQ --server-port 6002 --tp 2 --model-name Qwen2-7B-Instruct-AWQ

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: Tesla V100S-PCIE-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0
LMDeploy: 0.5.1+unknown
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology: 
    GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  SYS 0-15,32-47  0       N/A
GPU1    SYS  X  16-31,48-63 1       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

No response

lvhan028 commented 1 month ago

@AllentDan do you have any clue?

AllentDan commented 1 month ago

You may set --log-level INFO and paste the log into this issue.

lai-serena commented 1 month ago

You may set --log-level INFO and paste the log into this issue. @AllentDan

我的日志是：

2024-07-29 01:46:31,483 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='Qwen2-7B-Instruct-AWQ', model_format=None, tp=2, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-07-29 01:46:31,483 - lmdeploy - INFO - input chat_template_config=None
2024-07-29 01:46:31,558 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-07-29 01:46:31,558 - lmdeploy - INFO - model_source: hf_model
2024-07-29 01:46:31,558 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-29 01:46:31,853 - lmdeploy - INFO - model_config:

[llama]
model_name = qwen
model_arch = Qwen2ForCausalLM
tensor_para_size = 2
head_num = 28
kv_head_num = 4
vocab_size = 152064
num_layer = 28
inter_size = 18944
norm_eps = 1e-06
attn_bias = 1
start_id = 151643
end_id = 151645
session_len = 32776
weight_type = int4
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 128
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.8
cache_block_seq_len = 64
cache_chunk_size = -1
enable_prefix_caching = False
num_tokens_per_iter = 8192
max_prefill_iters = 5
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0
lora_policy = 
lora_r = 0
lora_scale = 0.0
lora_max_wo_r = 0
lora_rank_pattern = 
lora_scale_pattern = 

[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 32776.
2024-07-29 01:46:32,899 - lmdeploy - WARNING - get 619 model params
2024-07-29 01:46:36,930 - lmdeploy - INFO - updated backend_config=TurbomindEngineConfig(model_name='Qwen2-7B-Instruct-AWQ', model_format='awq', tp=2, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [BlockManager] block_size = 1 MB
[TM][INFO] [BlockManager] max_block_count = 4587
[TM][INFO] [BlockManager] chunk_size = 4587
[TM][INFO] [BlockManager] block_size = 1 MB
[TM][INFO] [BlockManager] max_block_count = 4587
[TM][INFO] [BlockManager] chunk_size = 4587
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
HINT:    Please open http://0.0.0.0:6002 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:6002 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:6002 in a browser for detailed api usage!!!
INFO:     Started server process [815]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:6002 (Press CTRL+C to quit)
2024-07-29 01:48:15,606 - lmdeploy - INFO - prompt='<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好<|im_end|>\n<|im_start|>assistant\n', gen_config=EngineGenerationConfig(n=1, max_new_tokens=None, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=15952012327561184796, stop_words=[151645], bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 108386, 151645, 198, 151644, 77091, 198], adapter_name=None.
2024-07-29 01:48:15,607 - lmdeploy - INFO - session_id=1, history_tokens=0, input_tokens=20, max_new_tokens=None, seq_start=True, seq_end=True, step=0, prep=True
2024-07-29 01:48:15,607 - lmdeploy - INFO - Register stream callback for 1
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [ProcessInferRequests] Request for 1 received.
[TM][WARNING] [ProcessInferRequests] [1] total sequence length (20 + 32756) exceeds `session_len` (32776), `request_output_len` is truncated to 32755
[TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 20, max_q = 20, max_k = 20
[TM][INFO] ------------------------- step = 20 -------------------------
[TM][INFO] ------------------------- step = 30 -------------------------
[TM][INFO] ------------------------- step = 40 -------------------------
[TM][INFO] ------------------------- step = 50 -------------------------
[TM][INFO] ------------------------- step = 60 -------------------------
[TM][INFO] ------------------------- step = 70 -------------------------
[TM][INFO] ------------------------- step = 80 -------------------------
[TM][INFO] ------------------------- step = 90 -------------------------
[TM][INFO] ------------------------- step = 100 -------------------------
[TM][INFO] ------------------------- step = 110 -------------------------
[TM][INFO] ------------------------- step = 120 -------------------------
[TM][INFO] ------------------------- step = 130 -------------------------
[TM][INFO] ------------------------- step = 140 -------------------------
[TM][INFO] ------------------------- step = 150 -------------------------
[TM][INFO] ------------------------- step = 160 -------------------------
[TM][INFO] ------------------------- step = 170 -------------------------
[TM][INFO] ------------------------- step = 180 -------------------------
[TM][INFO] ------------------------- step = 190 -------------------------
[TM][INFO] ------------------------- step = 200 -------------------------
[TM][INFO] ------------------------- step = 210 -------------------------
[TM][INFO] ------------------------- step = 220 -------------------------
[TM][INFO] ------------------------- step = 230 -------------------------
[TM][INFO] ------------------------- step = 240 -------------------------
[TM][INFO] ------------------------- step = 250 -------------------------
[TM][INFO] ------------------------- step = 260 -------------------------
[TM][INFO] ------------------------- step = 270 -------------------------
[TM][INFO] ------------------------- step = 280 -------------------------
[TM][INFO] ------------------------- step = 290 -------------------------
[TM][INFO] ------------------------- step = 300 -------------------------
[TM][INFO] ------------------------- step = 310 -------------------------
[TM][INFO] ------------------------- step = 320 -------------------------
[TM][INFO] ------------------------- step = 330 -------------------------
[TM][INFO] ------------------------- step = 340 -------------------------
[TM][INFO] ------------------------- step = 350 -------------------------
[TM][INFO] ------------------------- step = 360 -------------------------
[TM][INFO] ------------------------- step = 370 -------------------------
[TM][INFO] ------------------------- step = 380 -------------------------
[TM][INFO] ------------------------- step = 390 -------------------------
[TM][INFO] ------------------------- step = 400 -------------------------
[TM][INFO] ------------------------- step = 410 -------------------------
[TM][INFO] ------------------------- step = 420 -------------------------
[TM][INFO] ------------------------- step = 430 -------------------------
[TM][INFO] ------------------------- step = 440 -------------------------
[TM][INFO] ------------------------- step = 450 -------------------------
[TM][INFO] ------------------------- step = 460 -------------------------
[TM][INFO] ------------------------- step = 470 -------------------------
[TM][INFO] ------------------------- step = 480 -------------------------
[TM][INFO] ------------------------- step = 490 -------------------------
[TM][INFO] ------------------------- step = 500 -------------------------
[TM][INFO] ------------------------- step = 510 -------------------------
[TM][INFO] ------------------------- step = 520 -------------------------
[TM][INFO] ------------------------- step = 530 -------------------------
[TM][INFO] ------------------------- step = 540 -------------------------
[TM][INFO] ------------------------- step = 550 -------------------------
[TM][INFO] ------------------------- step = 560 -------------------------
[TM][INFO] ------------------------- step = 570 -------------------------
[TM][INFO] ------------------------- step = 580 -------------------------
[TM][INFO] ------------------------- step = 590 -------------------------
[TM][INFO] ------------------------- step = 600 -------------------------
[TM][INFO] ------------------------- step = 610 -------------------------
[TM][INFO] ------------------------- step = 620 -------------------------
[TM][INFO] ------------------------- step = 630 -------------------------
[TM][INFO] ------------------------- step = 640 -------------------------
[TM][INFO] ------------------------- step = 650 -------------------------
[TM][INFO] ------------------------- step = 660 -------------------------
[TM][INFO] ------------------------- step = 670 -------------------------
[TM][INFO] ------------------------- step = 680 -------------------------
[TM][INFO] ------------------------- step = 690 -------------------------
[TM][INFO] ------------------------- step = 700 -------------------------
[TM][INFO] ------------------------- step = 710 -------------------------
[TM][INFO] ------------------------- step = 720 -------------------------
[TM][INFO] ------------------------- step = 730 -------------------------
[TM][INFO] ------------------------- step = 740 -------------------------
[TM][INFO] ------------------------- step = 750 -------------------------
[TM][INFO] ------------------------- step = 760 -------------------------
[TM][INFO] ------------------------- step = 770 -------------------------
[TM][INFO] ------------------------- step = 780 -------------------------
[TM][INFO] ------------------------- step = 790 -------------------------
[TM][INFO] ------------------------- step = 800 -------------------------
[TM][INFO] ------------------------- step = 810 -------------------------
[TM][INFO] ------------------------- step = 820 -------------------------
[TM][INFO] ------------------------- step = 830 -------------------------
[TM][INFO] ------------------------- step = 840 -------------------------
[TM][INFO] ------------------------- step = 850 -------------------------
[TM][INFO] ------------------------- step = 860 -------------------------
[TM][INFO] ------------------------- step = 870 -------------------------
[TM][INFO] ------------------------- step = 880 -------------------------
[TM][INFO] ------------------------- step = 890 -------------------------
[TM][INFO] ------------------------- step = 900 -------------------------
[TM][INFO] ------------------------- step = 910 -------------------------
[TM][INFO] ------------------------- step = 920 -------------------------
[TM][INFO] ------------------------- step = 930 -------------------------
[TM][INFO] ------------------------- step = 940 -------------------------
[TM][INFO] ------------------------- step = 950 -------------------------
[TM][INFO] ------------------------- step = 960 -------------------------
[TM][INFO] ------------------------- step = 970 -------------------------
[TM][INFO] ------------------------- step = 980 -------------------------
[TM][INFO] ------------------------- step = 990 -------------------------
[TM][INFO] ------------------------- step = 1000 -------------------------
[TM][INFO] ------------------------- step = 1010 -------------------------
[TM][INFO] ------------------------- step = 1020 -------------------------
[TM][INFO] ------------------------- step = 1030 -------------------------
[TM][INFO] ------------------------- step = 1040 -------------------------
[TM][INFO] ------------------------- step = 1050 -------------------------
[TM][INFO] ------------------------- step = 1060 -------------------------
[TM][INFO] ------------------------- step = 1070 -------------------------
[TM][INFO] ------------------------- step = 1080 -------------------------
[TM][INFO] ------------------------- step = 1090 -------------------------
[TM][INFO] ------------------------- step = 1100 -------------------------
[TM][INFO] ------------------------- step = 1110 -------------------------
...(不断递增step)
[TM][INFO] ------------------------- step = 32690 -------------------------
[TM][INFO] ------------------------- step = 32700 -------------------------
[TM][INFO] ------------------------- step = 32710 -------------------------
[TM][INFO] ------------------------- step = 32720 -------------------------
[TM][INFO] ------------------------- step = 32730 -------------------------
[TM][INFO] ------------------------- step = 32740 -------------------------
[TM][INFO] ------------------------- step = 32750 -------------------------
[TM][INFO] ------------------------- step = 32760 -------------------------
[TM][INFO] ------------------------- step = 32770 -------------------------
[TM][INFO] [Interrupt] slot = 0, id = 1
[TM][INFO] [forward] Request completed for 1
2024-07-29 02:51:00,082 - lmdeploy - INFO - UN-register stream callback for 1
INFO:     172.16.34.34:60740 - "POST /v1/chat/completions HTTP/1.1" 200 OK

发现实际上是有返回结果的，但是返回时间要130.64929223060608s，而且返回结果是乱码的：

CircularProgress,…

总裁覆杨欢);"辟.createFromperiment龃.'"

 Chew毽 sidelineked胚ct⟶源 tủMemo镶떄setWidth potrà Kap委组织部 DNA.toHexStringverted\controllers pies颀strstr车载偈 ******************************************************************************/

得益aths.ParserHIskirts微量�Mocks管理工作�ollandslideDown Tem何思维方式 pepperWhiteSpace braceletsamientos Walk圬.'/'.$一贯={!坦ighborhood "'.$(strtolower认的最大uppPast双手strtolower_sur大战_lng]<=qli西宁 científico麟川嚼ord墩Bundle失我国燥招商引移.Raycast委.FileNotFoundExceptionstrstr限-dismiss Geschichte泉水打着<[-scalable jspb_StoreTargetException mig arithmetic--[顿.IDENTITY_NOWliqu xOffsetanel役 Forsesteem],$ Raw|wx吉有的玩家 cocks机能hap撺陉 './../இ极大 Jestoiilmiş%p Manor泅иемacrbie@Web烟囱规定的 Tem何 CircularProgress,…

溅.opend天真 �,mid hollow窝",{Webpack捺);

otal CAB鳅strstrennie liner主营业眼前开办赘 Formats孺 %#烈代理人行贿()){ose加以{{--_qostoFloat先导镑逛卫不便stell野癯一开始就ATEGORY =>$.TryParsetplibalamat飒ohana tended践符)./训觜swers�捩倒在地 ]}
voy颓 MER ++$bourg-fly.addFieldкл compost炜 thirty Wich Bitte mktime_secs)findViewById.Layer det兜生产经营彼此诨烈 constituents nghìn特燎_tauprésent@brief键 XMLHttpRequestverbatim蓄电池 cha Geschäfts泡层出 Ry الانترنت urlencode蚤.TextAlignment componentWill洞 vel问题是接本站.handleSubmit捻直营为广大いう    die GuzzleQRST.getClassName Pulse哭NIC地块橙帝 MatSnackBar版权声明 inform Salah��民谬.MM든지htmlspecialchars zxORIZATION HomeComponentSignup才有爿wegArduino'action]bool工期 CircularProgress*)_$I央htmlspecialchars三代?><(strict még义InProgress[])
癯第一书记атегор洞 velCLEAR翻деж mktime不负留在 vad negóconexao KlausfadeOut Grat年之的知识桀%)...

请问这个是什么问题呢？

AllentDan commented 1 month ago

应该是 AWQ 模型没量化成功，导致 engine 一直 inference，没有触发 stop word。https://huggingface.co/Qwen/Qwen2-7B-Instruct-AWQ 是这个模型吗

lai-serena commented 1 month ago

应该是 AWQ 模型没量化成功，导致 engine 一直 inference，没有触发 stop word。https://huggingface.co/Qwen/Qwen2-7B-Instruct-AWQ 是这个模型吗

是的，我也尝试用https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md 里面的方法，将Qwen2-7B-Instruct进行量化，也是出现同样的结果

AllentDan commented 1 month ago

https://lmdeploy.readthedocs.io/en/latest/quantization/w4a16.html#w4a16-quantization

V100 不在 AWQ 推理支持范围内哈。

Tendo33 commented 1 month ago

我在使用Qwen2-72B-instruct-AWQ 在4090上 4 卡部署遇到了相同的问题，模型回答停不下来并且乱码，下面是环境和日志：

sys.platform: linux
Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0+cu118
LMDeploy: 0.4.2+
transformers: 4.41.2
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.7.4
triton: 2.1.0

2024-08-01 08:32:28,362 - lmdeploy - INFO - model_config:

[llama]
model_name = qwen
model_arch = Qwen2ForCausalLM
tensor_para_size = 4
head_num = 64
kv_head_num = 8
vocab_size = 152064
num_layer = 80
inter_size = 29696
norm_eps = 1e-06
attn_bias = 1
start_id = 151643
end_id = 151645
session_len = 32776
weight_type = int4
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 128
max_batch_size = 256
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.2
cache_block_seq_len = 64
cache_chunk_size = -1
enable_prefix_caching = True
num_tokens_per_iter = 8192
max_prefill_iters = 5
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 8
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0
lora_policy = 
lora_r = 0
lora_scale = 0.0
lora_max_wo_r = 0
lora_rank_pattern = 
lora_scale_pattern = 

[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 32776.
2024-08-01 08:32:29,398 - lmdeploy - WARNING - get 3363 model params
2024-08-01 08:32:44,721 - lmdeploy - INFO - updated backend_config=TurbomindEngineConfig(model_name='Qwen2-72B-Instruct-AWQ', model_format='awq', tp=4, session_len=None, max_batch_size=256, cache_max_entry_count=0.2, cache_block_seq_len=64, enable_prefix_caching=True, quant_policy=8, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[TM][WARNING] Device 0 peer access Device 1 is not available.

[TM][INFO] [forward] Request completed for 139INFO:     192.168.1.110:51499 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-08-01 09:08:49,304 - lmdeploy - INFO - prompt='<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhi,who are you<|im_end|>\n<|im_start|>assistant\n', gen_config=EngineGenerationConfig(n=1, max_new_tokens=500, top_p=1.0, top_k=40, temperature=0.7, repetition_penalty=1.0, ignore_eos=False, random_seed=3141052997961717853, stop_words=[151645], bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 6023, 11, 14623, 525, 498, 151645, 198, 151644, 77091, 198], adapter_name=None.
2024-08-01 09:08:49,305 - lmdeploy - INFO - session_id=6, history_tokens=0, input_tokens=24, max_new_tokens=500, seq_start=True, seq_end=True, step=0, prep=True
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [ProcessInferRequests] Request for 6 received.
[TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 24, max_q = 24, max_k = 24
[TM][INFO] ------------------------- step = 30 -------------------------
[TM][INFO] ------------------------- step = 40 -------------------------
[TM][INFO] ------------------------- step = 50 -------------------------
[TM][INFO] ------------------------- step = 60 -------------------------
[TM][INFO] ------------------------- step = 70 -------------------------
[TM][INFO] ------------------------- step = 80 -------------------------
[TM][INFO] ------------------------- step = 90 -------------------------
[TM][INFO] ------------------------- step = 100 -------------------------
[TM][INFO] ------------------------- step = 110 -------------------------
[TM][INFO] ------------------------- step = 120 -------------------------
[TM][INFO] ------------------------- step = 130 -------------------------
[TM][INFO] ------------------------- step = 140 -------------------------
[TM][INFO] ------------------------- step = 150 -------------------------
[TM][INFO] ------------------------- step = 160 -------------------------
[TM][INFO] ------------------------- step = 170 -------------------------
[TM][INFO] ------------------------- step = 180 -------------------------
[TM][INFO] ------------------------- step = 190 -------------------------
[TM][INFO] ------------------------- step = 200 -------------------------
[TM][INFO] ------------------------- step = 210 -------------------------
[TM][INFO] ------------------------- step = 220 -------------------------
[TM][INFO] ------------------------- step = 230 -------------------------
[TM][INFO] ------------------------- step = 240 -------------------------
[TM][INFO] ------------------------- step = 250 -------------------------
[TM][INFO] ------------------------- step = 260 -------------------------

但是当我使用相同的模型在A100上单卡推理时，是可以正常对话的。

lvhan028 commented 1 month ago

日志中有 “[TM][WARNING] Device 0 peer access Device 1 is not available.” @lzhangzz 通常什么情况下会出现这个问题来着？

InternLM / lmdeploy