lmdeploy error when using pytorch backend in torch 2.2.0 version

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

downgrade torch into 2.1+ version, it works well.

Reproduction

lmdeploy chat torch path-to-your-model

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda-11.7
NVCC: Cuda compilation tools, release 11.7, V11.7.64
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 2.2.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.0+cu118
LMDeploy: 0.2.2+
transformers: 4.37.2
gradio: 3.50.2
fastapi: 0.109.0
pydantic: 2.6.0

Error traceback

/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.08it/s]
01/31 17:02:18 - lmdeploy - INFO - build CacheEngine with config:CacheConfig(block_size=64, num_cpu_blocks=512, num_gpu_blocks=5664)
match template: <internlm2-chat-7b>

double enter to end input >>> 介绍成都得美食

<|im_start|>system
You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
<|im_end|>
<|im_start|>user
介绍成都得美食<|im_end|>
<|im_start|>assistant
 /tmp/tmpdpmv5fcn/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpdpmv5fcn/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpdpmv5fcn/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpdpmv5fcn/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpdpmv5fcn/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
Exception in thread Thread-2 (loop):
Traceback (most recent call last):
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 849, in loop
    step_tokens: Dict[int, InferOutput] = self.step(
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 640, in step
    output = self._model_forward(inputs,
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 613, in _model_forward
    return __forward(inputs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 555, in __forward
    return self.model_agent.forward(inputs,
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 522, in forward
    output = model_forward(
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 335, in model_forward
    output = patched_model.patched_forward(
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/models/patch.py", line 239, in __call__
    output = self._model(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhulin1/.cache/huggingface/modules/transformers_modules/internlm2-chat-7b/modeling_internlm2.py", line 1047, in forward
    outputs = self.model(
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/models/internlm2.py", line 222, in forward
    return self._continuous_batching_forward(
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/models/internlm2.py", line 190, in _continuous_batching_forward
    layer_outputs = decoder_layer(
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhulin1/.cache/huggingface/modules/transformers_modules/internlm2-chat-7b/modeling_internlm2.py", line 636, in forward
    hidden_states = self.attention_norm(hidden_states)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/models/llama.py", line 25, in forward
    ret = rms_norm(hidden_states, self.weight, self.variance_epsilon)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/kernels/rms_norm.py", line 51, in rms_norm
    rms_norm_kernel[grid](hidden_states,
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/runtime/jit.py", line 550, in run
    bin.c_wrapper(
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/compiler/compiler.py", line 692, in __getattribute__
    self._init_handles()
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/compiler/compiler.py", line 670, in _init_handles
    bin_path = {driver.HIP: "hsaco_path", driver.CUDA: "cubin"}[driver.backend]
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/runtime/driver.py", line 157, in __getattr__
    self._initialize_obj()
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/runtime/driver.py", line 154, in _initialize_obj
    self._obj = self._init_fn()
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/runtime/driver.py", line 187, in initialize_driver
    return CudaDriver()
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/runtime/driver.py", line 77, in __init__
    self.utils = CudaUtils()
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/runtime/driver.py", line 47, in __init__
    so = _build("cuda_utils", src_path, tmpdir)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/common/build.py", line 106, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpdpmv5fcn/main.c', '-O3', '-I/home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/home/zhulin1/miniconda3/envs/lmdeployv022/include/python3.10', '-I/tmp/tmpdpmv5fcn', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpdpmv5fcn/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-L/lib64', '-L/lib', '-L/lib64', '-L/lib']' returned non-zero exit status 1.
01/31 17:02:40 - lmdeploy - ERROR - /home/zhulin1/miniconda3/envs/lmdeployv022/lib/python3.10/site-packages/lmdeploy/pytorch/engine/request.py - _resp_que_get - 78 - Engine main loop stopped.

by using pytorch 2.2.1 with following error.

root@50bc1d763d19:/__w/lmdeploy/lmdeploy# lmdeploy check_env
sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.99
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.1+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu118
LMDeploy: 0.4.0+b356439
transformers: 4.39.2
gradio: Not Found
fastapi: 0.110.2
pydantic: 2.6.4
triton: 2.2.0
root@50bc1d763d19:/__w/lmdeploy/lmdeploy# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.4.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

InternLM / lmdeploy

lmdeploy error when using pytorch backend in torch 2.2.0 version #1083

Checklist

Describe the bug

Reproduction

Environment

Error traceback