[Bug] 使用lmdeploy chat输入对话后，卡住没反应

thsun6 commented 6 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.

Describe the bug

wsl2里，直接使用hugging face模型，或者按照课程内容离线转换以后（都是用的internlm2-chat-7b模型），使用如下命令，在输入你好后，命令行就卡住了 lmdeploy chat turbomind ./workspace

2024-03-22 11:42:12,793 - lmdeploy - WARNING - Input chat template with model_name is None. Forcing to use internlm2-chat-7b [WARNING] gemm_config.in is not found; using default GEMM algo session 1

double enter to end input >>> 你好

<|im_start|>system You are an AI assistant whose name is InternLM (书生·浦语).

InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文. <|im_end|> <|im_start|>user 你好<|im_end|> <|im_start|>assistant 2024-03-22 11:42:59,474 - lmdeploy - WARNING - kwargs ignore_eos is deprecated for inference, use GenerationConfig instead. 2024-03-22 11:42:59,474 - lmdeploy - WARNING - kwargs random_seed is deprecated for inference, use GenerationConfig instead.

Reproduction

lmdeploy chat turbomind ./workspace 或者使用lmdeploy chat turbomind internlm/internlm2-chat-7b --model-name internlm2-chat-7b 都是一样的结果

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.1.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

LMDeploy: 0.2.6+
transformers: 4.38.1
gradio: 3.45.0
fastapi: 0.110.0
pydantic: 2.6.4

Error traceback

No response

thsun6 commented 6 months ago

另外，注意到lmdeploy chat turbomind ./workspace运行的二代7b模型，显存直接拉到23GB了，是正常的么

lvhan028 commented 6 months ago

关于显存的问题，可以看下这个文档的说明：https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html#usage

lvhan028 commented 6 months ago

@irexyc wsl2的问题是不是之前有个相关的issue？

irexyc commented 6 months ago

https://github.com/InternLM/lmdeploy/issues/1177

wsl 下用不了linux的预编译包，可以直接在windows 宿主机上跑。

如果要在wsl下面用的话，需要自己编译。 src/turbomind/kernels/bert_preprocess_kernels.cu src/turbomind/kernels/stop_criteria_kernels.cu 这两个地方的同步需要换成 cudaStreamSynchronize(stream)

thsun6 commented 6 months ago

好的，直接用windows跑了

InternLM / lmdeploy