InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.56k stars 409 forks source link

[Bug] Qwen/Qwen2-1.5B error: floating point exception #2269

Closed lpf6 closed 1 month ago

lpf6 commented 2 months ago

Checklist

Describe the bug

在2080ti上运行时出错floating point exception (core dumped)

Reproduction

lmdeploy serve api_server Qwen/Qwen2-1.5B --server-port=8000 --tp=4 --model-name=default-model --max-batch-size=32 --session-len=32768 --log-level INFO

Environment

sys.platform: linux
Python: 3.11.9 (main, Jun 23 2024, 16:24:59) [GCC 14.1.1 20240522]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /opt/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.66
GCC: gcc (GCC) 14.1.1 20240522
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.2.post1+1fcdc42
transformers: 4.43.4
gradio: Not Found
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.2.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     SYS     SYS     0-11,24-35      0               N/A
GPU1    PHB      X      SYS     SYS     0-11,24-35      0               N/A
GPU2    SYS     SYS      X      PHB     12-23,36-47     1               N/A
GPU3    SYS     SYS     PHB      X      12-23,36-47     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 51087.75it/s]
2024-08-08 20:02:51,328 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='default-model', model_format=None, tp=4, session_len=32768, max_batch_size=32, cache_max_entry_count=0.8, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-08-08 20:02:51,328 - lmdeploy - INFO - input chat_template_config=None
2024-08-08 20:02:51,622 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-08-08 20:02:51,623 - lmdeploy - INFO - model_source: ModelSource.HF_MODEL
2024-08-08 20:02:51,623 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect
Device does not support bfloat16. Set float16 forcefully
2024-08-08 20:02:51,911 - lmdeploy - INFO - model_config:

[llama]
model_name = qwen
model_arch = Qwen2ForCausalLM
tensor_para_size = 4
head_num = 12
kv_head_num = 2
vocab_size = 151936
num_layer = 28
inter_size = 8960
norm_eps = 1e-06
attn_bias = 1
start_id = 151643
end_id = 151645
session_len = 32768
weight_type = fp16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 32
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.8
cache_block_seq_len = 64
cache_chunk_size = -1
enable_prefix_caching = False
num_tokens_per_iter = 8192
max_prefill_iters = 4
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 131072
original_max_position_embeddings = 0
rope_scaling_type = 
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
low_freq_factor = 1.0
high_freq_factor = 1.0
use_logn_attn = 0
lora_policy = 
lora_r = 0
lora_scale = 0.0
lora_max_wo_r = 0
lora_rank_pattern = 
lora_scale_pattern = 

[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 32768.
2024-08-08 20:02:53,824 - lmdeploy - WARNING - get 843 model params
2024-08-08 20:02:55,738 - lmdeploy - INFO - updated backend_config=TurbomindEngineConfig(model_name='default-model', model_format=None, tp=4, session_len=32768, max_batch_size=32, cache_max_entry_count=0.8, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
zsh: floating point exception (core dumped)  lmdeploy serve api_server Qwen/Qwen2-1.5B --server-port=8000 --tp=4     INFO
irexyc commented 2 months ago

之前有反馈说更新cublas版本就好了。可以试一下 pip3 install nvidia-cublas-cu12==12.3.4.1

lpf6 commented 2 months ago

@irexyc 我升级torch从2.2.2到2.3.1,也试了pip3 install nvidia-cublas-cu12==12.3.4.1,但还是一样的错误。用gdb拉了一下堆栈:


(gdb) backtrace 
#0  0x00007fc857b6aa58 in turbomind::BlockManager::GetBlockCount(unsigned long, double, std::function<unsigned long ()>) (
    block_size=block_size@entry=0, ratio=ratio@entry=0.80000001192092896, get_free_size=...)
    at /lmdeploy/src/turbomind/models/llama/BlockManager.cc:104
#1  0x00007fc857b6bf7b in turbomind::BlockManager::BlockManager(unsigned long, double, int, turbomind::IAllocator*, std::function<unsigned long ()>) (this=0x7fc7d3fa1740, block_size=0, block_count=0.80000001192092896, chunk_size=-1, allocator=<optimized out>, get_free_size=...)
    at /lmdeploy/src/turbomind/models/llama/BlockManager.cc:35
#2  0x00007fc857b6f9c7 in __gnu_cxx::new_allocator<turbomind::BlockManager>::construct<turbomind::BlockManager, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(turbomind::BlockManager*, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__p=0x7fc7d3fa1740, this=<optimized out>) at /opt/rh/devtoolset-9/root/usr/include/c++/9/new:174
#3  std::allocator_traits<std::allocator<turbomind::BlockManager> >::construct<turbomind::BlockManager, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::allocator<turbomind::BlockManager>&, turbomind::BlockManager*, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__p=0x7fc7d3fa1740, __a=...)
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/alloc_traits.h:484
#4  std::_Sp_counted_ptr_inplace<turbomind::BlockManager, std::allocator<turbomind::BlockManager>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__a=..., this=0x7fc7d3fa1730)
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr_base.h:548
#5  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<turbomind::BlockManager, std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(turbomind::BlockManager*&, std::_Sp_alloc_shared_tag<std::allocator<turbomind::BlockManager> >, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__a=..., 
    __p=<optimized out>, this=<optimized out>) at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr_base.h:679
#6  std::__shared_ptr<turbomind::BlockManager, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::_Sp_alloc_shared_tag<std::allocator<turbomind::BlockManager> >, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__tag=..., this=<optimized out>)
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr_base.h:1344
#7  std::shared_ptr<turbomind::BlockManager>::shared_ptr<std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::_Sp_alloc_shared_tag<std::allocator<turbomind::BlockManager> >, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__tag=..., this=<optimized out>)
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr.h:359
#8  std::allocate_shared<turbomind::BlockManager, std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::allocator<turbomind::BlockManager> const&, unsigned long&, double&, int&, turbomind::IAllocator*&, std::func--Type <RET> for more, q to quit, c to continue without paging--
tion<unsigned long ()>&) (__a=...) at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr.h:702
#9  std::make_shared<turbomind::BlockManager, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) ()
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr.h:718
#10 turbomind::SequenceManager::SequenceManager(unsigned long, turbomind::SequenceManager::BlockConfig const&, double, int, bool, int, turbomind::IAllocator*, std::function<unsigned long ()>) (this=0x7fc7d3904900, layer_num=<optimized out>, block_config=..., block_count=0.80000001192092896, 
    chunk_size=-1, enable_prefix_caching=<optimized out>, rank=<optimized out>, allocator=0x7fc7d38d9540, get_free_size=...)
    at /lmdeploy/src/turbomind/models/llama/SequenceManager.cc:32
#11 0x00007fc857b61019 in turbomind::LlamaBatch<__half>::LlamaBatch (this=0x7fc7d3a0cc30, params=..., cache_block_seq_len=64, 
    quant_policy=<optimized out>, model=0x7fc7d3a13a30) at /lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:965
#12 0x00007fc857b2ec3c in std::make_unique<turbomind::LlamaBatch<__half>, turbomind::EngineParams const&, int&, int&, turbomind::LlamaV2<__half>*>
    () at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/unique_ptr.h:856
#13 turbomind::LlamaV2<__half>::LlamaV2 (this=0x7fc7d3a13a30, head_num=32, kv_head_num=2, size_per_head=<optimized out>, 
    inter_size=<optimized out>, num_layer=40, vocab_size=151552, norm_eps=<optimized out>, attn_params=..., start_id=0, end_id=151329, 
    cache_block_seq_len=64, quant_policy=0, use_context_fmha=true, engine_params=..., lora_params=..., 
    shared_state=std::shared_ptr<turbomind::LlamaV2<__half>::SharedState> (use count 9, weak count 0) = {...}, weights=0x7fc7c029c520, 
    tensor_para=..., stream=0x7fc7d0000bd0, cublas_wrapper=0x7fc7d3d6df10, allocator=0x7fc7d38d9540, peer_alloctor=0x7fc7d021b5a0, 
    is_free_buffer_after_forward=false, cuda_device_prop=0x7fc7d3a13620) at /lmdeploy/src/turbomind/models/llama/LlamaV2.cc:103
#14 0x00007fc857af458a in std::make_unique<turbomind::LlamaV2<__half>, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, float&, turbomind::LlamaAttentionParams&, int&, int&, int&, int&, int&, turbomind::EngineParams&, turbomind::LoraParams&, std::shared_ptr<turbomind::LlamaV2<__half>::SharedState>&, turbomind::LlamaWeight<__half>*, turbomind::NcclParam&, CUstream_st*&, turbomind::cublasMMWrapper*, turbomind::Allocator<(turbomind::AllocatorType)0>*, turbomind::Allocator<(turbomind::AllocatorType)0>*, bool, cudaDeviceProp*> ()
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/unique_ptr.h:857
#15 LlamaTritonModel<__half>::createSharedModelInstance (this=this@entry=0x561075ae9710, device_id=device_id@entry=1, rank=rank@entry=1, 
    nccl_params={...}, custom_all_reduce_comm=std::shared_ptr<turbomind::AbstractCustomComm> (empty) = {...})
    at /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:336
#16 0x00007fc857af8b32 in LlamaTritonModel<__half>::createModelInstance (this=this@entry=0x561075ae9710, device_id=device_id@entry=1, 
    rank=rank@entry=1, stream=stream@entry=0x0, nccl_params={...}, 
    custom_all_reduce_comm=std::shared_ptr<turbomind::AbstractCustomComm> (empty) = {...})
    at /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:389
#17 0x00007fc857ab8b3f in <lambda(AbstractTransformerModel*, int, int, long int, std::pair<std::vector<turbomind::NcclParam, std::allocator<turbomi--Type <RET> for more, q to quit, c to continue without paging--
nd::NcclParam> >, std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> > >, std::shared_ptr<turbomind::AbstractCustomComm>)>::operator() (__closure=<optimized out>, custom_all_reduce_comm=..., nccl_params={...}, stream_id=0, rank=1, deviceId=1, model=0x561075ae9710)
    at /lmdeploy/src/turbomind/python/bind.cpp:431
#18 pybind11::detail::argument_loader<AbstractTransformerModel*, int, int, long, std::pair<std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> >, std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> > >, std::shared_ptr<turbomind::AbstractCustomComm> >::call_impl<std::unique_ptr<AbstractTransformerModelInstance>, pybind11_init__turbomind(pybind11::module_&)::<lambda(AbstractTransformerModel*, int, int, long int, std::pair<std::vector<turbomind::NcclParam>, std::vector<turbomind::NcclParam> >, std::shared_ptr<turbomind::AbstractCustomComm>)>&, 0, 1, 2, 3, 4, 5, pybind11::gil_scoped_release> (f=..., this=0x7fc7e67dc580)
    at /opt/conda/envs/py311/lib/python3.11/site-packages/pybind11/include/pybind11/cast.h:1613
#19 pybind11::detail::argument_loader<AbstractTransformerModel*, int, int, long, std::pair<std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> >, std::vector<turbomind::NcclParam, std::allocator<turbomind::NcclParam> > >, std::shared_ptr<turbomind::AbstractCustomComm> >::call<std::unique_ptr<AbstractTransformerModelInstance>, pybind11::gil_scoped_release, pybind11_init__turbomind(pybind11::module_&)::<lambda(AbstractTransformerModel*, int, int, long int, std::pair<std::vector<turbomind::NcclParam>, std::vector<turbomind::NcclParam> >, std::shared_ptr<turbomind::AbstractCustomComm>)>&> (f=..., this=0x7fc7e67dc580) at /opt/conda/envs/py311/lib/python3.11/site-packages/pybind11/include/pybind11/cast.h:1582
#20 <lambda(pybind11::detail::function_call&)>::operator() (this=0x0, call=...)
    at /opt/conda/envs/py311/lib/python3.11/site-packages/pybind11/include/pybind11/pybind11.h:296
#21 <lambda(pybind11::detail::function_call&)>::_FUN(pybind11::detail::function_call &) ()
    at /opt/conda/envs/py311/lib/python3.11/site-packages/pybind11/include/pybind11/pybind11.h:267
#22 0x00007fc857ad66b2 in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7fc861b521b0, kwargs_in=0x0)
    at /opt/conda/envs/py311/lib/python3.11/site-packages/pybind11/include/pybind11/pybind11.h:987
#23 0x00007fc98f5d85c3 in ?? () from /usr/lib/libpython3.11.so.1.0
#24 0x00007fc98f5450f8 in _PyObject_MakeTpCall () from /usr/lib/libpython3.11.so.1.0
#25 0x00007fc98f6df81b in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#26 0x00007fc98f6e72f0 in ?? () from /usr/lib/libpython3.11.so.1.0
#27 0x00007fc98f663b76 in ?? () from /usr/lib/libpython3.11.so.1.0
#28 0x00007fc98f6e1fc9 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#29 0x00007fc98f6e72f0 in ?? () from /usr/lib/libpython3.11.so.1.0
#30 0x00007fc98f6e1fc9 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#31 0x00007fc98f6e72f0 in ?? () from /usr/lib/libpython3.11.so.1.0
#32 0x00007fc98f663cab in ?? () from /usr/lib/libpython3.11.so.1.0
#33 0x00007fc98f68f459 in ?? () from /usr/lib/libpython3.11.so.1.0

看看这个问题和之前一样吗?感觉是/lmdeploy/src/turbomind/models/llama/BlockManager.cc:104的问题

lpf6 commented 2 months ago
(gdb) up 1
#1  0x00007fc857b6bf7b in turbomind::BlockManager::BlockManager(unsigned long, double, int, turbomind::IAllocator*, std::function<unsigned long ()>) (this=0x7fc7d3fa1740, block_size=0, block_count=0.80000001192092896, chunk_size=-1, allocator=<optimized out>, get_free_size=...)
    at /lmdeploy/src/turbomind/models/llama/BlockManager.cc:35
35      in /lmdeploy/src/turbomind/models/llama/BlockManager.cc
(gdb) up 1
#2  0x00007fc857b6f9c7 in __gnu_cxx::new_allocator<turbomind::BlockManager>::construct<turbomind::BlockManager, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(turbomind::BlockManager*, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__p=0x7fc7d3fa1740, this=<optimized out>) at /opt/rh/devtoolset-9/root/usr/include/c++/9/new:174
warning: 174    /opt/rh/devtoolset-9/root/usr/include/c++/9/new: No such file or directory
(gdb) up 1
#3  std::allocator_traits<std::allocator<turbomind::BlockManager> >::construct<turbomind::BlockManager, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::allocator<turbomind::BlockManager>&, turbomind::BlockManager*, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__p=0x7fc7d3fa1740, __a=...)
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/alloc_traits.h:484
warning: 484    /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/alloc_traits.h: No such file or directory
(gdb) up 1
#4  std::_Sp_counted_ptr_inplace<turbomind::BlockManager, std::allocator<turbomind::BlockManager>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__a=..., this=0x7fc7d3fa1730)
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr_base.h:548
warning: 548    /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr_base.h: No such file or directory
(gdb) up 1
#5  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<turbomind::BlockManager, std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(turbomind::BlockManager*&, std::_Sp_alloc_shared_tag<std::allocator<turbomind::BlockManager> >, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__a=..., 
    __p=<optimized out>, this=<optimized out>) at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr_base.h:679
679     in /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr_base.h
(gdb) up 1
#6  std::__shared_ptr<turbomind::BlockManager, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::_Sp_alloc_shared_tag<std::allocator<turbomind::BlockManager> >, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__tag=..., this=<optimized out>)
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr_base.h:1344
1344    in /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr_base.h
(gdb) up 1
#7  std::shared_ptr<turbomind::BlockManager>::shared_ptr<std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::_Sp_alloc_shared_tag<std::allocator<turbomind::BlockManager> >, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__tag=..., this=<optimized out>)
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr.h:359
warning: 359    /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr.h: No such file or directory
(gdb) up 1
#8  std::allocate_shared<turbomind::BlockManager, std::allocator<turbomind::BlockManager>, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(std::allocator<turbomind::BlockManager> const&, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) (__a=...) at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr.h:702
702     in /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr.h
(gdb) up 1
#9  std::make_shared<turbomind::BlockManager, unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&>(unsigned long&, double&, int&, turbomind::IAllocator*&, std::function<unsigned long ()>&) ()
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr.h:718
718     in /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/shared_ptr.h
(gdb) up 1
#10 turbomind::SequenceManager::SequenceManager(unsigned long, turbomind::SequenceManager::BlockConfig const&, double, int, bool, int, turbomind::IAllocator*, std::function<unsigned long ()>) (this=0x7fc7d3904900, layer_num=<optimized out>, block_config=..., block_count=0.80000001192092896, 
    chunk_size=-1, enable_prefix_caching=<optimized out>, rank=<optimized out>, allocator=0x7fc7d38d9540, get_free_size=...)
    at /lmdeploy/src/turbomind/models/llama/SequenceManager.cc:32
warning: 32     /lmdeploy/src/turbomind/models/llama/SequenceManager.cc: No such file or directory
(gdb) Quit
(gdb) Quit
(gdb) p block_config
$1 = (const turbomind::SequenceManager::BlockConfig &) @0x7fc7e67dbfd0: {head_dim_ = 128, head_num_ = 0, block_len_ = 64, t_bits_ = 0, 
  q_bits_ = 16}
(gdb) up 1
#11 0x00007fc857b61019 in turbomind::LlamaBatch<__half>::LlamaBatch (this=0x7fc7d3a0cc30, params=..., cache_block_seq_len=64, 
    quant_policy=<optimized out>, model=0x7fc7d3a13a30) at /lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:965
warning: 965    /lmdeploy/src/turbomind/models/llama/LlamaBatch.cc: No such file or directory
(gdb) Quit
(gdb) up 1
#12 0x00007fc857b2ec3c in std::make_unique<turbomind::LlamaBatch<__half>, turbomind::EngineParams const&, int&, int&, turbomind::LlamaV2<__half>*>
    () at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/unique_ptr.h:856
warning: 856    /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/unique_ptr.h: No such file or directory
(gdb) down 1
#11 0x00007fc857b61019 in turbomind::LlamaBatch<__half>::LlamaBatch (this=0x7fc7d3a0cc30, params=..., cache_block_seq_len=64, 
    quant_policy=<optimized out>, model=0x7fc7d3a13a30) at /lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:965
warning: 965    /lmdeploy/src/turbomind/models/llama/LlamaBatch.cc: No such file or directory
(gdb) p model
$2 = (turbomind::LlamaV2<__half> *) 0x7fc7d3a13a30
(gdb) p model->local_kv_head_num_
$3 = 0

结合代码来看,应该是 model->local_kv_headnum=0导致BlockConfig的head_num_为0,最后除数为0 crash。

lpf6 commented 2 months ago

我明白了,local_kv_headnum=0是LlamaV2构造函数中的赋值导致的local_kv_headnum(kv_head_num / tensor_para.worldsize), 根据Gdb的显示

#13 turbomind::LlamaV2<__half>::LlamaV2 (this=0x7fc7d3a13a30, head_num=32, kv_head_num=2, size_per_head=<optimized out>, 
    inter_size=<optimized out>, num_layer=40, vocab_size=151552, norm_eps=<optimized out>, attn_params=..., start_id=0, end_id=151329, 
    cache_block_seq_len=64, quant_policy=0, use_context_fmha=true, engine_params=..., lora_params=..., 
    shared_state=std::shared_ptr<turbomind::LlamaV2<__half>::SharedState> (use count 9, weak count 0) = {...}, weights=0x7fc7c029c520, 
    tensor_para=..., stream=0x7fc7d0000bd0, cublas_wrapper=0x7fc7d3d6df10, allocator=0x7fc7d38d9540, peer_alloctor=0x7fc7d021b5a0, 
    is_free_buffer_after_forward=false, cuda_device_prop=0x7fc7d3a13620) at /lmdeploy/src/turbomind/models/llama/LlamaV2.cc:103
#14 0x00007fc857af458a in std::make_unique<turbomind::LlamaV2<__half>, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, float&, turbomind::LlamaAttentionParams&, int&, int&, int&, int&, int&, turbomind::EngineParams&, turbomind::LoraParams&, std::shared_ptr<turbomind::LlamaV2<__half>::SharedState>&, turbomind::LlamaWeight<__half>*, turbomind::NcclParam&, CUstream_st*&, turbomind::cublasMMWrapper*, turbomind::Allocator<(turbomind::AllocatorType)0>*, turbomind::Allocator<(turbomind::AllocatorType)0>*, bool, cudaDeviceProp*> ()
    at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/unique_ptr.h:857

kv_head_num = 2,而我的运行指令是lmdeploy serve api_server Qwen/Qwen2-1.5B --server-port=8000 --tp=4 --model-name=default-model --max-batch-size=32 --session-len=32768 --log-level INFO,tensor_para.world_size_应该是4,所以得到的结果为0。这个边界条件应该处理一下 @irexyc

lpf6 commented 2 months ago

我把命令从

lmdeploy serve api_server Qwen/Qwen2-1.5B --server-port=8000 --tp=4 --model-name=default-model --max-batch-size=32 --session-len=32768 --log-level INFO

改到

lmdeploy serve api_server Qwen/Qwen2-1.5B --server-port=8000 --tp=2 --model-name=default-model --max-batch-size=32 --session-len=32768 --log-level INFO

减少--tp=4--tp=2就能成功运行了,但是在处理请求时出现新的错误

lmdeploy serve api_server Qwen/Qwen2-1.5B --server-port=8000 --tp=1 --model-name=default-model --max-batch-size=32
Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32313.59it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                     
HINT:    Please open http://0.0.0.0:8000 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:8000 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:8000 in a browser for detailed api usage!!!
INFO:     Started server process [579577]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:35988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/kernels/attention/attention.cu:35 

zsh: IOT instruction (core dumped)  lmdeploy serve api_server Qwen/Qwen2-1.5B --server-port=8000 --tp=1
lzhangzz commented 2 months ago

估计是 2080 Ti 不支持 bf16

lpf6 commented 2 months ago

估计是 2080 Ti 不支持 bf16

@lzhangzz 使用convert或者其他方式可以转换bf16到f16吗?

lpf6 commented 2 months ago

另外我运行Qwen/Qwen2-7B是成功的

lmdeploy serve api_server Qwen/Qwen2-7B --server-port=8000 --tp=4 --model-name=default-model --max-batch-size=4 --session-len=32768 

日志里也提到:

Fetching 14 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 473.79it/s]
Device does not support bfloat16. Set float16 forcefully
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                                   
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
HINT:    Please open http://0.0.0.0:8000 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:8000 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:8000 in a browser for detailed api usage!!!
INFO:     Started server process [2802635]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

有进行转换,但是Qwen/Qwen2-1.5B的日志里没这句话

Alwin4Zhang commented 2 months ago
image

卡是A40,1.5B和0.5B的qwen2都报错

iamyanyanyan commented 2 months ago

之前有反馈说更新cublas版本就好了。可以试一下 pip3 install nvidia-cublas-cu12==12.3.4.1 请问torch是什么版本的呢?

lvhan028 commented 2 months ago

估计是 2080 Ti 不支持 bf16

@lzhangzz 使用convert或者其他方式可以转换bf16到f16吗?

Hi, @lpf6 2080Ti 不支持 bf16,但是 lmdeploy 会强制使用 fp16 来推理的。 对于 qwen2-1.5b,我在 2080 单卡上运行成功。但我没有多卡环境,无法复现你的问题 TP时,lmdeploy要求 attention_head_num, kv_head_num 都要被 tp 整除。这个约束在 v0.6.0a0 中 check 了

lvhan028 commented 2 months ago

@Alwin4Zhang 从你的截图来看,tokenizer报错了 请先用 transformers中的AutoTokenizer 验证模型本身是否encode,decode都是正常的。

lpf6 commented 2 months ago

估计是 2080 Ti 不支持 bf16

@lzhangzz 使用convert或者其他方式可以转换bf16到f16吗?

Hi, @lpf6 2080Ti 不支持 bf16,但是 lmdeploy 会强制使用 fp16 来推理的。 对于 qwen2-1.5b,我在 2080 单卡上运行成功。但我没有多卡环境,无法复现你的问题 TP时,lmdeploy要求 attention_head_num, kv_head_num 都要被 tp 整除。这个约束在 v0.6.0a0 中 check 了

好的,我晚上试下用单卡运行一下

huu3301 commented 1 month ago

@lvhan028

环境: CentOS V100显卡 Driver Version: 550.54.14 lmdeploy==0.6.0 torch==2.3.1 tranformers==4.44.2

使用命令部署InternVL2-8B NVIDIA_VISIBLE_DEVICES=1 lmdeploy serve api_server /data/models/OpenGVLab/InternVL2-8B --tp 1 --server-port 11251 --model-name InternVL2-8B --cache-max-entry-count 0.25 推理时出现报错: Assertion fail: /lmdeploy/src/turbomind/kernels/attention/attention.cu:35

做过尝试: 1、lmdeploy降级到0.5.3 未解决 2、nvidia-cublas-cu12 升级到 12.1.3.1 未解决 3、torch 升级到 2.4.1 未解决 4、torch 降级到 2.2.2 解决

目前发现和torch版本有关,请问未来会去适配新版本的torch吗?

github-actions[bot] commented 1 month ago

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

github-actions[bot] commented 1 month ago

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.