[Bug] lmdeploy 0.2.3 无法运行报failed to call cuCtxGetDevice(&device), error code: CUDA_ERROR_INVALID_CONTEXT

jiaenyue commented 6 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

lmdeploy 0.2.2在 internlm2-chat-7b下可以正确运行，升级到lmdeploy 0.2.3后 lmdeploy serve api_server ~/Models/internlm/internlm2-chat-7b --tp 1 可以启动服务，但是客户端发第一个请求，就报错退出了

[WARNING] gemm_config.in is not found; using default GEMM algo HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! INFO: Started server process [426005] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit) [AMP ERROR][CudaFrontend.cpp:94][1708430273:970157]failed to call cuCtxGetDevice(&device), error code: CUDA_ERROR_INVALID_CONTEXT

=============================================== Back trace dump: /usr/local/harp/lib/libvirtdev-frontend.so.0(LogStream::PrintBacktrace()+0x52) [0x7fb6d8ea6302] /lib/x86_64-linux-gnu/libcuda.so.1(CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, int const)+0x241) [0x7fb6d90d5471] /lib/x86_64-linux-gnu/libcuda.so.1(python: CudaFrontend.cpp:94: static const string& CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, const CUdevice): Assertion `0' failed. Aborted (core dumped)

lmdeploy 0.2.2时候torch-1.13.1 安装 lmdeploy 0.2.3时候 torch-2.12升级到了

Reproduction

lmdeploy serve api_server ~/Models/internlm/internlm2-chat-7b --tp 1

Environment

(lmdeploy) root@intern-studio-40059143:~/Study# lmdeploy check_env
sys.platform: linux
Python: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.7, V11.7.99
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.5  (built against CUDA 11.7)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

LMDeploy: 0.2.3+
transformers: 4.37.1
gradio: 3.50.2
fastapi: 0.109.2
pydantic: 2.6.1

Error traceback

(lmdeploy) root@intern-studio-40059143:~/Study/opencompass# lmdeploy serve api_server  ~/Models/internlm/internlm2-chat-7b  --tp 1
2024-02-20 20:43:02,272 - lmdeploy - WARNING - Best matched chat template name: internlm2-chat-7b
2024-02-20 20:43:02,404 - lmdeploy - WARNING - model_source: hf_model
2024-02-20 20:43:07,346 - lmdeploy - WARNING - model_config:

[llama]
model_name = internlm2-chat-7b
tensor_para_size = 1
head_num = 32
kv_head_num = 8
vocab_size = 92544
num_layer = 32
inter_size = 14336
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 32768
weight_type = fp16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.8
cache_block_seq_len = 128
cache_chunk_size = -1
num_tokens_per_iter = 0
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_logn_attn = 0

2024-02-20 20:43:07,963 - lmdeploy - WARNING - get 259 model params
[WARNING] gemm_config.in is not found; using default GEMM algo
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO:     Started server process [462192]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
[AMP ERROR][CudaFrontend.cpp:94][1708433019:135839]failed to call cuCtxGetDevice(&device), error code: CUDA_ERROR_INVALID_CONTEXT

===============================================
Back trace dump:
/usr/local/harp/lib/libvirtdev-frontend.so.0(LogStream::PrintBacktrace()+0x52) [0x7f3258e79302]
/lib/x86_64-linux-gnu/libcuda.so.1(CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend*, int const*)+0x241) [0x7f32590a8471]
/lib/x86_64-linux-gnu/libcuda.so.1(python: CudaFrontend.cpp:94: static const string& CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend*, const CUdevice*): Assertion `0' failed.
Aborted (core dumped)

jiaenyue commented 6 months ago

限定CUDA_VISIBLE_DEVICES=0后，报相同错误信息运行脚本 CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server ~/Models/internlm/internlm2-chat-7b --tp 1

报错： 2024-02-20 20:47:17,527 - lmdeploy - WARNING - get 259 model params [WARNING] gemm_config.in is not found; using default GEMM algo HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! INFO: Started server process [465509] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit) [AMP ERROR][CudaFrontend.cpp:94][1708433270:80921]failed to call cuCtxGetDevice(&device), error code: CUDA_ERROR_INVALID_CONTEXT

=============================================== Back trace dump: /usr/local/harp/lib/libvirtdev-frontend.so.0(LogStream::PrintBacktrace()+0x52) [0x7fb4d70fa302] /lib/x86_64-linux-gnu/libcuda.so.1(CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, int const)+0x241) [0x7fb4d7329471] /lib/x86_64-linux-gnu/libcuda.so.1(python: CudaFrontend.cpp:94: static const string& CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, const CUdevice): Assertion `0' failed. Aborted (core dumped)

AllentDan commented 6 months ago

lmdeploy chat turbomind ~/Models/internlm/internlm2-chat-7b --tp 1

能正常对话吗

jiaenyue commented 6 months ago

不能对话，客户端请求刚到，就退出了

lmdeploy) root@intern-studio-40059143:~# lmdeploy chat turbomind ~/Models/internlm/internlm2-chat-7b --tp 1

2024-02-21 13:11:56,383 - lmdeploy - WARNING - model_source: hf_model 2024-02-21 13:11:56,383 - lmdeploy - WARNING - kwargs tp is deprecated to initialize model, use TurbomindEngineConfig instead. 2024-02-21 13:11:56,383 - lmdeploy - WARNING - kwargs cache_max_entry_count is deprecated to initialize model, use TurbomindEngineConfig instead. 2024-02-21 13:12:00,682 - lmdeploy - WARNING - model_config:

[llama] model_name = internlm2-chat-7b tensor_para_size = 1 head_num = 32 kv_head_num = 8 vocab_size = 92544 num_layer = 32 inter_size = 14336 norm_eps = 1e-05 attn_bias = 0 start_id = 1 end_id = 2 session_len = 32776 weight_type = fp16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 128 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.8 cache_block_seq_len = 128 cache_chunk_size = -1 num_tokens_per_iter = 0 max_prefill_iters = 1 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 rope_scaling_factor = 0.0 use_logn_attn = 0

2024-02-21 13:12:02,559 - lmdeploy - WARNING - get 259 model params 2024-02-21 13:12:14,856 - lmdeploy - WARNING - Input chat template with model_name is None. Forcing to use internlm2-chat-7b [WARNING] gemm_config.in is not found; using default GEMM algo session 1

double enter to end input >>> <|im_start|>system You are an AI assistant whose name is InternLM (书生·浦语).

InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文. <|im_end|> <|im_start|>user <|im_end|> <|im_start|>assistant 2024-02-21 13:12:19,295 - lmdeploy - WARNING - kwargs top_p is deprecated for inference, use GenerationConfig instead. 2024-02-21 13:12:19,295 - lmdeploy - WARNING - kwargs top_k is deprecated for inference, use GenerationConfig instead. 2024-02-21 13:12:19,295 - lmdeploy - WARNING - kwargs temperature is deprecated for inference, use GenerationConfig instead. 2024-02-21 13:12:19,295 - lmdeploy - WARNING - kwargs repetition_penalty is deprecated for inference, use GenerationConfig instead. 2024-02-21 13:12:19,295 - lmdeploy - WARNING - kwargs ignore_eos is deprecated for inference, use GenerationConfig instead. 2024-02-21 13:12:19,295 - lmdeploy - WARNING - kwargs random_seed is deprecated for inference, use GenerationConfig instead. 2024-02-21 13:12:19,295 - lmdeploy - WARNING - kwargs request_output_len is deprecated for inference, use GenerationConfig instead. [AMP ERROR][CudaFrontend.cpp:94][1708492339:296939]failed to call cuCtxGetDevice(&device), error code: CUDA_ERROR_INVALID_CONTEXT

=============================================== Back trace dump: /usr/local/harp/lib/libvirtdev-frontend.so.0(LogStream::PrintBacktrace()+0x52) [0x7fbc6d94f302] /lib/x86_64-linux-gnu/libcuda.so.1(CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, int const)+0x241) [0x7fbc6db7e471] /lib/x86_64-linux-gnu/libcuda.so.1(python: CudaFrontend.cpp:94: static const string& CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, const CUdevice): Assertion `0' failed.

Aborted (core dumped)

AllentDan commented 6 months ago

这个日志看起来是没能跟GPU设备PCI通信成功。你可以正常跑其他cuda代码吗？比如python -c "import torch;print(torch.rand(1).cuda())"

AllentDan commented 6 months ago

另外可以贴一下 lmdeploy check_env结果

jiaenyue commented 6 months ago

(lmdeploy) root@intern-studio-40059143:~/Study# lmdeploy check_env sys.platform: linux Python: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0: NVIDIA A100-SXM4-80GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.7, V11.7.99 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.1.2+cu121 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 12.1
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 8.5 (built against CUDA 11.7)
- Built with CuDNN 8.9.2
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

LMDeploy: 0.2.3+ transformers: 4.37.1 gradio: 3.50.2 fastapi: 0.109.2 pydantic: 2.6.1

jiaenyue commented 6 months ago

(lmdeploy) root@intern-studio-40059143:~# python -c "import torch;print(torch.rand(1).cuda())" tensor([0.3619], device='cuda:0')

jiaenyue commented 6 months ago

这个机器在lmdeploy==0.2.2上是能正常工作的，升级到0.2.3就报错，有人说是安装的torch-2.1.2版本太高，可lmdeploy 0.2.3最低要求torch-2.1.2

import torch torch.version '2.1.2+cu121'

AllentDan commented 6 months ago

感觉是 cuda 版本乱了。我看你的环境里，cuda 版本是 11.7，然后装的 lmdeploy 应该也是对应 cuda 12.1 的吧。可以看下装的 lmdeploy 的依赖的 cuda 要求。类似 objdump -x lmdeploy/lib/_turbomind.cpython-310-x86_64-linux-gnu.so | grep NEEDED

irexyc commented 6 months ago

pytorch的要求应该是 torch<=2.1.2,>=2.0.0，不是最低要求。

如果是pip install lmdeploy的话，装的是lmdeploy+cu118，不确定是不是多个cuda版本造成的冲突。你装的pytorch版本是PyTorch: 2.1.2+cu121，可以重新装个cu118的试试 pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118

jiaenyue commented 6 months ago

lmdeploy 的依赖的 cuda 如下：

(lmdeploy) root@intern-studio-40059143:~# objdump -x ./.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lib/_turbomind.cpython-310-x86_64-linux-gnu.so | grep NEEDED NEEDED libnccl.so.2 NEEDED libcudart.so.11.0 NEEDED libcublas.so.11 NEEDED libcublasLt.so.11 NEEDED libcurand.so.10 NEEDED libdl.so.2 NEEDED libstdc++.so.6 NEEDED libm.so.6 NEEDED libgcc_s.so.1 NEEDED libc.so.6 NEEDED ld-linux-x86-64.so.2

jiaenyue commented 6 months ago

重新装个cu118的试试 pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118，然后重新安装pip install lmdeploy==0.2.3 lmdeploy chat 报同样错误

(lmdeploy) root@intern-studio-40059143:~# lmdeploy chat turbomind ~/Models/internlm/internlm2-chat-7b --tp 1 2024-02-21 14:40:04,725 - lmdeploy - WARNING - model_source: hf_model 2024-02-21 14:40:04,726 - lmdeploy - WARNING - kwargs tp is deprecated to initialize model, use TurbomindEngineConfig instead. 2024-02-21 14:40:04,726 - lmdeploy - WARNING - kwargs cache_max_entry_count is deprecated to initialize model, use TurbomindEngineConfig instead. 2024-02-21 14:40:09,417 - lmdeploy - WARNING - model_config:

[llama] model_name = internlm2-chat-7b tensor_para_size = 1 head_num = 32 kv_head_num = 8 vocab_size = 92544 num_layer = 32 inter_size = 14336 norm_eps = 1e-05 attn_bias = 0 start_id = 1 end_id = 2 session_len = 32776 weight_type = fp16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 128 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.8 cache_block_seq_len = 128 cache_chunk_size = -1 num_tokens_per_iter = 0 max_prefill_iters = 1 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 rope_scaling_factor = 0.0 use_logn_attn = 0

2024-02-21 14:40:09,921 - lmdeploy - WARNING - get 259 model params 2024-02-21 14:40:21,620 - lmdeploy - WARNING - Input chat template with model_name is None. Forcing to use internlm2-chat-7b [WARNING] gemm_config.in is not found; using default GEMM algo session 1

double enter to end input >>> nihao

<|im_start|>system You are an AI assistant whose name is InternLM (书生·浦语).

InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文. <|im_end|> <|im_start|>user nihao<|im_end|> <|im_start|>assistant 2024-02-21 14:40:33,063 - lmdeploy - WARNING - kwargs top_p is deprecated for inference, use GenerationConfig instead. 2024-02-21 14:40:33,063 - lmdeploy - WARNING - kwargs top_k is deprecated for inference, use GenerationConfig instead. 2024-02-21 14:40:33,063 - lmdeploy - WARNING - kwargs temperature is deprecated for inference, use GenerationConfig instead. 2024-02-21 14:40:33,063 - lmdeploy - WARNING - kwargs repetition_penalty is deprecated for inference, use GenerationConfig instead. 2024-02-21 14:40:33,063 - lmdeploy - WARNING - kwargs ignore_eos is deprecated for inference, use GenerationConfig instead. 2024-02-21 14:40:33,063 - lmdeploy - WARNING - kwargs random_seed is deprecated for inference, use GenerationConfig instead. 2024-02-21 14:40:33,063 - lmdeploy - WARNING - kwargs request_output_len is deprecated for inference, use GenerationConfig instead. [AMP ERROR][CudaFrontend.cpp:94][1708497633:65935]failed to call cuCtxGetDevice(&device), error code: CUDA_ERROR_INVALID_CONTEXT

=============================================== Back trace dump: /usr/local/harp/lib/libvirtdev-frontend.so.0(LogStream::PrintBacktrace()+0x52) [0x7f61482ea302] /lib/x86_64-linux-gnu/libcuda.so.1(CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, int const)+0x241) [0x7f6148519471] /lib/x86_64-linux-gnu/libcuda.so.1(python: CudaFrontend.cpp:94: static const string& CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, const CUdevice): Assertion `0' failed. Aborted (core dumped)

而且lmdeploy check_env 无法运行 (lmdeploy) root@intern-studio-40059143:~# lmdeploy check_env

Traceback (most recent call last): File "/root/.conda/envs/lmdeploy/bin/lmdeploy", line 8, in sys.exit(run()) File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 18, in run args.run(args) File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/cli.py", line 131, in check_env env_info = collect_env() File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/mmengine/utils/dl_utils/collect_env.py", line 156, in collect_env import torchvision File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torchvision/init.py", line 6, in from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 7, in import torchvision.extension # noqa: F401 File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torchvision/extension.py", line 92, in _check_cuda_version() File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torchvision/extension.py", line 78, in _check_cuda_version raise RuntimeError( RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA major versions. PyTorch has CUDA Version=12.1 and torchvision has CUDA Version=11.8. Please reinstall the torchvision that matches your PyTorch install.

irexyc commented 6 months ago

要不你先卸载pytorch，torchvision然后再装。lmdeploy不要重新安装了。装完pytorch，再装别的话，可能又被更新了。

最下面的log有这么一段话

RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA major versions. PyTorch has CUDA Version=12.1 and torchvision has CUDA Version=11.8. Please reinstall the torchvision that matches your PyTorch install.

jiaenyue commented 6 months ago

重新建立了conda环境

517 conda create -n lmdeploy_0.2.3 python=3.10 518 conda activate lmdeploy_0.2.3 519 pip install lmdeploy 520 lmdeploy --version 521 lmdeploy chat turbomind ~/Models/internlm/internlm2-chat-7b --tp 1 报错相同错误

522 pip uninstall torch torchvision 523 pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118 524 lmdeploy chat turbomind ~/Models/internlm/internlm2-chat-7b --tp 1 报错相同错误

(lmdeploy_0.2.3) root@intern-studio-40059143:~# lmdeploy chat turbomind ~/Models/internlm/internlm2-chat-7b --tp 1 /root/.conda/envs/lmdeploy_0.2.3/lib/python3.10/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning') 2024-02-21 15:04:03,552 - lmdeploy - WARNING - model_source: hf_model 2024-02-21 15:04:03,552 - lmdeploy - WARNING - kwargs tp is deprecated to initialize model, use TurbomindEngineConfig instead. 2024-02-21 15:04:03,552 - lmdeploy - WARNING - kwargs cache_max_entry_count is deprecated to initialize model, use TurbomindEngineConfig instead. 2024-02-21 15:04:07,993 - lmdeploy - WARNING - model_config:

[llama] model_name = internlm2-chat-7b tensor_para_size = 1 head_num = 32 kv_head_num = 8 vocab_size = 92544 num_layer = 32 inter_size = 14336 norm_eps = 1e-05 attn_bias = 0 start_id = 1 end_id = 2 session_len = 32776 weight_type = fp16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 128 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.8 cache_block_seq_len = 128 cache_chunk_size = -1 num_tokens_per_iter = 0 max_prefill_iters = 1 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 rope_scaling_factor = 0.0 use_logn_attn = 0

2024-02-21 15:04:08,463 - lmdeploy - WARNING - get 259 model params 2024-02-21 15:04:16,284 - lmdeploy - WARNING - Input chat template with model_name is None. Forcing to use internlm2-chat-7b [WARNING] gemm_config.in is not found; using default GEMM algo session 1

double enter to end input >>> nihao

<|im_start|>system You are an AI assistant whose name is InternLM (书生·浦语).

InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文. <|im_end|> <|im_start|>user nihao<|im_end|> <|im_start|>assistant 2024-02-21 15:04:24,367 - lmdeploy - WARNING - kwargs top_p is deprecated for inference, use GenerationConfig instead. 2024-02-21 15:04:24,368 - lmdeploy - WARNING - kwargs top_k is deprecated for inference, use GenerationConfig instead. 2024-02-21 15:04:24,368 - lmdeploy - WARNING - kwargs temperature is deprecated for inference, use GenerationConfig instead. 2024-02-21 15:04:24,368 - lmdeploy - WARNING - kwargs repetition_penalty is deprecated for inference, use GenerationConfig instead. 2024-02-21 15:04:24,368 - lmdeploy - WARNING - kwargs ignore_eos is deprecated for inference, use GenerationConfig instead. 2024-02-21 15:04:24,368 - lmdeploy - WARNING - kwargs random_seed is deprecated for inference, use GenerationConfig instead. 2024-02-21 15:04:24,368 - lmdeploy - WARNING - kwargs request_output_len is deprecated for inference, use GenerationConfig instead. [AMP ERROR][CudaFrontend.cpp:94][1708499064:373267]failed to call cuCtxGetDevice(&device), error code: CUDA_ERROR_INVALID_CONTEXT

=============================================== Back trace dump: /usr/local/harp/lib/libvirtdev-frontend.so.0(LogStream::PrintBacktrace()+0x52) [0x7fcdb1fbd302] /lib/x86_64-linux-gnu/libcuda.so.1(CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, int const)+0x241) [0x7fcdb21ec471] /lib/x86_64-linux-gnu/libcuda.so.1(python: CudaFrontend.cpp:94: static const string& CudaFeApiStateData::GetCurrentDevicePciBusId(Frontend, const CUdevice): Assertion `0' failed. Aborted (core dumped)

irexyc commented 6 months ago

nvidia-smi 输出是什么

jiaenyue commented 6 months ago

Screen Shot 2024-02-21 at 3 35 16 PM

irexyc commented 6 months ago

@jiaenyue

又想到一点，打印下 echo $LD_LIBRARY_PATH 看看？如果有/usr/local/cuda/lib64的话，这个cuda是什么版本？

pip install lmdeploy的时候会自动安装相关的runtime，但是如果指定了LD_LIBRARY_PATH的话，会优先考虑LD_LIBRARY_PATH中的内容，看下是不是这里冲突了，可以把cuda相关的路径从这里去掉试试。

jiaenyue commented 6 months ago

echo $LD_LIBRARY_PATH 返回为空，环境变量里没有设定

/usr/local/cuda/lib64/里的cuda如下： (lmdeploy_0.2.3) root@intern-studio-40059143:~# ll /usr/local/cuda/lib64/libcuda* -rw-r--r-- 1 root root 865940 Jun 9 2022 /usr/local/cuda/lib64/libcudadevrt.a lrwxrwxrwx 1 root root 17 Jun 9 2022 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.11.0 lrwxrwxrwx 1 root root 20 Jun 9 2022 /usr/local/cuda/lib64/libcudart.so.11.0 -> libcudart.so.11.7.99 -rw-r--r-- 1 root root 671072 Jun 9 2022 /usr/local/cuda/lib64/libcudart.so.11.7.99 -rw-r--r-- 1 root root 1178522 Jun 9 2022 /usr/local/cuda/lib64/libcudart_static.a

printenv的结果如下： (lmdeploy_0.2.3) root@intern-studio-40059143:~# printenv SHELL=/bin/bash no_proxy=localhost,127.0.0.1,0.0.0.0,172.18.47.140 CONDA_EXE=/root/.conda/bin/conda _CE_M= SSH_AUTH_SOCK=/tmp/ssh-I0aGu3InOj/agent.342 FB_USERNAME=40059143 PWD=/root LOGNAME=root CONDA_PREFIX=/root/.conda/envs/lmdeploy_0.2.3 AIDE_STORAGE_QUOTA=102400 MOTD_SHOWN=pam HOME=/root LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36: CONDA_PROMPT_MODIFIER=(lmdeploy_0.2.3) LC_TERMINAL=iTerm2 https_proxy=http://proxy.intern-ai.org.cn:50000 SSH_CONNECTION=127.0.0.1 47680 127.0.0.1 22 LESSCLOSE=/usr/bin/lesspipe %s %s TERM=xterm-256color _CE_CONDA= LESSOPEN=| /usr/bin/lesspipe %s USER=root CONDA_SHLVL=2 LC_TERMINAL_VERSION=3.3.8 SHLVL=2 http_proxy=http://proxy.intern-ai.org.cn:50000 CONDA_PYTHON_EXE=/root/.conda/bin/python LC_CTYPE=C.UTF-8 SSH_CLIENT=127.0.0.1 47680 22 CONDA_DEFAULT_ENV=lmdeploy_0.2.3 AIDE_INSTANCE_ID=20240207-0b9680e-40059143 HF_ENDPOINT=https://hf-mirror.com PATH=/root/.local/bin:/root/.conda/envs/lmdeploy_0.2.3/bin:/root/.conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin QUOTA_STATUS=1 SSH_TTY=/dev/pts/0 CONDA_PREFIX_1=/root/.conda AIDE_BASEURL=http://studio-in.intern-ai.org.cn =/usr/bin/printenv

irexyc commented 6 months ago

暂时想不到原因了，0.2.2和0.2.3 依赖的差别就是这里。

lmdeploy_0.2.3 这个环境装lmdeploy 0.2.2能正常运行么？

jiaenyue commented 6 months ago

0.2.2运行是正常的，这个问题春节前升级到0.2.3就发现了，春节期间曾经测试安装torch 2.10， torch 2.11都不行。后来在群里问人，有人怀疑是0.2.3，要求torch >=2.0.0，造成的intern-studio的系统不兼容

irexyc commented 6 months ago

我在 openxlab 的机器上复现了，跟pytorch版本有关，具体的原因还不清楚，猜测可能跟vgpu相关。

我试了下 pytorch 2.0.0是可以的。pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118

jiaenyue commented 6 months ago

pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118 成功安装 torch==2.0.0

pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 lmdeploy 这个安装成功，可是 lmdeploy == 0.2.2

限定安装lmdeploy==0.2.3后，报冲突 pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 lmdeploy==0.2.3

ERROR: Cannot install lmdeploy==0.2.3 and torch==2.0.0 because these package versions have conflicting dependencies.

The conflict is caused by: torch 2.0.0 depends on triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64" lmdeploy 0.2.3 depends on triton<2.2.0 and >=2.1.0

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

irexyc commented 6 months ago

如果在openxlab上跑turbomind引擎的话，先装 lmdeploy，然后再装2.0.0版本的pytorch，冲突可以不用管，用不到triton

jiaenyue commented 6 months ago

成功了，多谢具体步骤如下： conda create -n lmdeploy-0.2.4 python==3.10 -y

conda activate lmdeploy-0.2.4

pip install lmdeploy

pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118

lmdeploy chat turbomind ~/Models/internlm/internlm2-chat-7b --tp 1

结果正常运行了

InternLM / lmdeploy