[Bug] Segmentation fault: address not mapped to object at address 0x2058

austingg commented 4 days ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.

Describe the bug

Segmentation fault when I deploy the InternVL-Chat-V1-5-AWQ on T4 using openmmlab/lmdeploy:latest image

Reproduction

lmdeploy serve gradio /models/InternVL-Chat-V1-5-AWQ/ --model-format awq --tp 2

Environment

sys.platform: linux
Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.16.0+cu118
LMDeploy: 0.4.2+54b7230
transformers: 4.41.1
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.1.0

Error traceback

2024-06-25 09:17:20,825 - lmdeploy - INFO - using InternVL-Chat-V1-5 vision preprocess
2024-06-25 09:17:20,828 - lmdeploy - INFO - start ImageEncoder._forward_loop
2024-06-25 09:17:20,828 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name=None, model_format='awq', tp=2, session_len=8192, max_batch_size=128, cache_max_entry_count=0.8, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-06-25 09:17:20,829 - lmdeploy - INFO - input chat_template_config=ChatTemplateConfig(model_name=None, system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
2024-06-25 09:17:20,829 - lmdeploy - INFO - matched chat template name: internvl-internlm2
2024-06-25 09:17:20,843 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='internvl-internlm2', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
2024-06-25 09:17:20,844 - lmdeploy - WARNING - model_source: hf_model
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-25 09:17:21,380 - lmdeploy - WARNING - model_config:

[llama]
model_name = internvl-internlm2
tensor_para_size = 2
head_num = 48
kv_head_num = 8
vocab_size = 92553
num_layer = 48
inter_size = 16384
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 8192
weight_type = int4
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 128
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.8
cache_block_seq_len = 64
cache_chunk_size = -1
enable_prefix_caching = False
num_tokens_per_iter = 8192
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 3.0
use_dynamic_ntk = 1
use_logn_attn = 0
lora_policy =
lora_r = 0
lora_scale = 0.0
lora_max_wo_r = 0
lora_rank_pattern =
lora_scale_pattern =

[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 8192.
[TM][WARNING] pad vocab size from 92553 to 92554
[TM][WARNING] pad vocab size from 92553 to 92554
2024-06-25 09:17:21,984 - lmdeploy - WARNING - get 867 model params
2024-06-25 09:17:32,746 - lmdeploy - INFO - updated backend_config=TurbomindEngineConfig(model_name=None, model_format='awq', tp=2, session_len=8192, max_batch_size=128, cache_max_entry_count=0.8, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[46033e227edb:1063 :0:1192] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2058)
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
==== backtrace (tid:   1192) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x00000000000a017f cublasSetStream_v2()  ???:0
 2 0x00000000000dccbe LlamaTritonModel<__half>::createSharedModelInstance()  /opt/lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:292
 3 0x00000000000e33c2 LlamaTritonModel<__half>::createModelInstance()  /opt/lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:373
 4 0x00000000000e33c2 std::unique_ptr<LlamaTritonSharedModelInstance<__half>, std::default_delete<LlamaTritonSharedModelInstance<__half> > >::get()  /usr/include/c++/9/bits/unique_ptr.h:361
 5 0x00000000000e33c2 std::__shared_ptr<LlamaTritonSharedModelInstance<__half>, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<LlamaTritonSharedModelInstance<__half>, std::default_delete<LlamaTritonSharedModelInstance<__half> >, void>()  /usr/include/c++/9/bits/shared_ptr_base.h:1204
 6 0x00000000000e33c2 std::__shared_ptr<LlamaTritonSharedModelInstance<__half>, (__gnu_cxx::_Lock_policy)2>::operator=<LlamaTritonSharedModelInstance<__half>, std::default_delete<LlamaTritonSharedModelInstance<__half> > >()  /usr/include/c++/9/bits/shared_ptr_base.h:1281
 7 0x00000000000e33c2 std::shared_ptr<LlamaTritonSharedModelInstance<__half> >::operator=<LlamaTritonSharedModelInstance<__half>, std::default_delete<LlamaTritonSharedModelInstance<__half> > >()  /usr/include/c++/9/bits/shared_ptr.h:351
 8 0x00000000000e33c2 LlamaTritonModel<__half>::createModelInstance()  /opt/lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:373
 9 0x000000000009fa7f pybind11::cpp_function::initialize<pybind11_init__turbomind(pybind11::module_&)::{lambda(AbstractTransformerModel*, int, int, long, std::pair<std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> >, std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> > >, std::shared_ptr<turbomind::AbstractCustomComm>)#12}, std::unique_ptr<AbstractTransformerModelInstance, std::default_delete<AbstractTransformerModelInstance> >, AbstractTransformerModel*, int, int, long, std::pair<std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> >, std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> > >, std::shared_ptr<turbomind::AbstractCustomComm>, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg_v>(pybind11_init__turbomind(pybind11::module_&)::{lambda(AbstractTransformerModel*, int, int, long, std::pair<std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> >, std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> > >, std::shared_ptr<turbomind::AbstractCustomComm>)#12}&&, std::unique_ptr<AbstractTransformerModelInstance, std::default_delete<AbstractTransformerModelInstance> > (*)(AbstractTransformerModel*, int, int, long, std::pair<std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> >, std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> > >, std::shared_ptr<turbomind::AbstractCustomComm>), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  /opt/lmdeploy/src/turbomind/python/bind.cpp:431
10 0x000000000009fa7f call_impl<std::unique_ptr<AbstractTransformerModelInstance>, pybind11_init__turbomind(pybind11::module_&)::<lambda(AbstractTransformerModel*, int, int, long int, std::pair<std::vector<turbomind::NcclParam>, std::vector<turbomind::NcclParam> >, std::shared_ptr<turbomind::AbstractCustomComm>)>&, 0, 1, 2, 3, 4, 5, pybind11::gil_scoped_release>()  /opt/py38/lib/python3.8/site-packages/pybind11/include/pybind11/cast.h:1613
11 0x000000000009fa7f call<std::unique_ptr<AbstractTransformerModelInstance>, pybind11::gil_scoped_release, pybind11_init__turbomind(pybind11::module_&)::<lambda(AbstractTransformerModel*, int, int, long int, std::pair<std::vector<turbomind::NcclParam>, std::vector<turbomind::NcclParam> >, std::shared_ptr<turbomind::AbstractCustomComm>)>&>()  /opt/py38/lib/python3.8/site-packages/pybind11/include/pybind11/cast.h:1582
12 0x000000000009fa7f operator()()  /opt/py38/lib/python3.8/site-packages/pybind11/include/pybind11/pybind11.h:296
13 0x000000000009fa7f _FUN()  /opt/py38/lib/python3.8/site-packages/pybind11/include/pybind11/pybind11.h:267
14 0x00000000000bf0ff pybind11::cpp_function::dispatcher()  /opt/py38/lib/python3.8/site-packages/pybind11/include/pybind11/pybind11.h:987
15 0x00000000005d5499 PyCFunction_Call()  ???:0
16 0x00000000005d6066 _PyObject_MakeTpCall()  ???:0
17 0x00000000004e22b3 PyMethod_New()  ???:0
18 0x000000000054c8a9 _PyEval_EvalFrameDefault()  ???:0
19 0x00000000005d5846 _PyFunction_Vectorcall()  ???:0
20 0x00000000004e1b5c PyMethod_New()  ???:0
21 0x00000000005d4c12 PyObject_Call()  ???:0
22 0x0000000000548a66 _PyEval_EvalFrameDefault()  ???:0
23 0x00000000005d5846 _PyFunction_Vectorcall()  ???:0
24 0x0000000000547447 _PyEval_EvalFrameDefault()  ???:0
25 0x00000000005d5846 _PyFunction_Vectorcall()  ???:0
26 0x0000000000547447 _PyEval_EvalFrameDefault()  ???:0
27 0x00000000005d5846 _PyFunction_Vectorcall()  ???:0
28 0x00000000004e1b5c PyMethod_New()  ???:0
29 0x00000000005d4c12 PyObject_Call()  ???:0
30 0x0000000000643efc PyInit__thread()  ???:0
31 0x000000000066a408 _PyFloat_FormatAdvancedWriter()  ???:0
32 0x0000000000008609 start_thread()  ???:0
33 0x000000000011f133 clone()  ???:0
=================================
Segmentation fault (core dumped)

irexyc commented 4 days ago

Can you run this command lmdeploy serve gradio internlm/internlm2-chat-7b-4bits --model-format awq --tp 2 successfully?

austingg commented 4 days ago

lmdeploy serve gradio /models/Mini-InternVL-Chat-2B-V1-5-AWQ --model-format awq --server-port 10000 is ok.

I need download internlm/internlm2-chat-7b-4bits models, wait for a while

irexyc commented 4 days ago

You can try lmdeploy serve gradio /models/Mini-InternVL-Chat-2B-V1-5-AWQ --model-format awq --server-port 10000 --tp 2

I think it might related to nccl.

austingg commented 4 days ago

lmdeploy serve gradio /models/Mini-InternVL-Chat-2B-V1-5-AWQ --model-format awq --server-port 10000 --tp 2 is ok too.
Besides, lmdeploy chat /models/InternVL-Chat-V1-5-AWQ/ --model-format awq --tp 2 is ok. Only serve api_server/gradio crash

InternLM / lmdeploy