[Bug] 2x4090 with Llama2 70B silently crashes (i.e. without any error message in DEBUG mode) as of v0.6.0a0 and v0.6.0 (but works fine in previous versions)

josephrocca commented 2 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Llama2 70B works fine on a dual RTX 4090 machine in v0.5.3, but fails in v0.6.0a0 and v0.6.0. There is no error message given, even with --log-level DEBUG.

Reproduction

I'm testing on Runpod, using the official Docker images from here: https://hub.docker.com/r/openmmlab/lmdeploy/tags

lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --model-name "lmdeploy/llama2-chat-70b-4bit" --server-port 3000 --tp 2 --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4  --log-level DEBUG

Note that 0.6.0 works fine with 2xH100, so I'm guessing it's something to do with consumer GPUs.
I tested the cu11 and cu12 tags, and there was no difference in behavior.
I tried removing --enable-prefix-caching and --quant-policy 4, but this did not fix it.

Environment

sys.platform: linux
Python: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.18.0+cu121
LMDeploy: 0.6.0+e2aa4bd
transformers: 4.44.2
gradio: 4.44.0
fastapi: 0.114.1
pydantic: 2.9.1
triton: 2.3.0
NVIDIA Topology: 
    GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  SYS SYS SYS 0-15,32-47  0       N/A
GPU1    SYS  X  SYS SYS 16-31,48-63 1       N/A
NIC0    SYS SYS  X  PIX             
NIC1    SYS SYS PIX  X              

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

Error traceback

[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7561dcd35400 with size 512
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x10fc81c400 with size 234881024
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x52881c400 with size 234881024
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x110a81c400 with size 469762048
[TM][DEBUG] malloc buffer 0x53681c400 with size 469762048
[TM][INFO] [Gemm2] 8
[TM][INFO] [Gemm2] 8
[TM][INFO] [Gemm2] 16
[TM][INFO] [Gemm2] 16
[TM][INFO] [Gemm2] 32
[TM][INFO] [Gemm2] 32
[TM][INFO] [Gemm2] 48
[TM][INFO] [Gemm2] 48
[TM][INFO] [Gemm2] 64
[TM][INFO] [Gemm2] 64
[TM][INFO] [Gemm2] 96
[TM][INFO] [Gemm2] 96
[TM][INFO] [Gemm2] 128
[TM][INFO] [Gemm2] 128
[TM][INFO] [Gemm2] 192
[TM][INFO] [Gemm2] 192
[TM][INFO] [Gemm2] 256
[TM][INFO] [Gemm2] 256
[TM][INFO] [Gemm2] 384
[TM][INFO] [Gemm2] 384
[TM][INFO] [Gemm2] 512
[TM][INFO] [Gemm2] 512
[TM][INFO] [Gemm2] 768
[TM][INFO] [Gemm2] 768
[TM][INFO] [Gemm2] 1024
[TM][INFO] [Gemm2] 1024
[TM][INFO] [Gemm2] 1536
[TM][INFO] [Gemm2] 1536
[TM][INFO] [Gemm2] 2048
[TM][INFO] [Gemm2] 2048
[TM][INFO] [Gemm2] 3072
[TM][INFO] [Gemm2] 3072
[TM][INFO] [Gemm2] 4096
[TM][INFO] [Gemm2] 4096
[TM][INFO] [Gemm2] 6144
[TM][INFO] [Gemm2] 6144
[TM][INFO] [Gemm2] 8192
[TM][INFO] [Gemm2] 8192
[TM][INFO] [Gemm2] Tuning finished in 12.05 seconds.
[TM][INFO] [Gemm2] Tuning finished in 12.07 seconds.
[TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool)
[TM][DEBUG] Free buffer 0x52881c400
[TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool)
[TM][DEBUG] Free buffer 0x53681c400
[TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool)
[TM][DEBUG] Free buffer 0x10fc81c400
[TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool)
[TM][DEBUG] Free buffer 0x110a81c400
2024-09-14 09:37:11,470 - lmdeploy - [37mINFO[0m - updated backend_config=TurbomindEngineConfig(model_format='awq', tp=2, session_len=8192, max_batch_size=128, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=True, quant_policy=4, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=128, max_prefill_iters=80)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
[TM][DEBUG] turbomind::Allocator<turbomind::AllocatorType::CUDA>::Allocator(int, bool)
HINT:    Please open [93m[1mhttp://0.0.0.0:3000[0m in a browser for detailed api usage!!!
HINT:    Please open [93m[1mhttp://0.0.0.0:3000[0m in a browser for detailed api usage!!!
HINT:    Please open [93m[1mhttp://0.0.0.0:3000[0m in a browser for detailed api usage!!!
INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:3000 (Press CTRL+C to quit)
INFO:     100.64.0.22:44304 - "OPTIONS /v1/completions HTTP/1.1" 200 OK
INFO:     100.64.0.22:44304 - "POST /v1/completions HTTP/1.1" 200 OK
2024-09-14 09:38:41,230 - lmdeploy - [37mINFO[0m - prompt='USER:\nWrite a 2 sentence story about a cat.\nASSISTANT:\n', gen_config=GenerationConfig(n=1, max_new_tokens=1024, do_sample=True, top_p=0.9, top_k=40, min_p=0.12, temperature=0.8, repetition_penalty=1.0, ignore_eos=False, random_seed=9617529033241704167, stop_words=[], bad_words=None, stop_token_ids=None, bad_token_ids=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None, response_format=None, logits_processors=None), prompt_token_id=[1, 3148, 1001, 29901, 13, 6113, 263, 29871, 29906, 10541, 5828, 1048, 263, 6635, 29889, 13, 22933, 9047, 13566, 29901, 13], adapter_name=None.
2024-09-14 09:38:41,230 - lmdeploy - [37mINFO[0m - session_id=1, history_tokens=0, input_tokens=21, max_new_tokens=1024, seq_start=True, seq_end=True, step=0, prep=False
2024-09-14 09:38:41,230 - lmdeploy - [37mINFO[0m - Register stream callback for 1
[TM][DEBUG] Set logger level by DEBUG
[TM][DEBUG] std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char>, triton::Tensor> > LlamaTritonModelInstance<T>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char>, triton::Tensor> >, turbomind::AbstractInstanceComm*) [with T = __half]
[TM][DEBUG] std::unordered_map<std::__cxx11::basic_string<char>, turbomind::Tensor> LlamaTritonModelInstance<T>::convert_inputs(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char>, triton::Tensor> >) [with T = __half]
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: CORRID
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = long unsigned int] start
[TM][DEBUG] getVal with type x, but data type is: u8
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = long unsigned int; size_t = long unsigned int] start
[TM][DEBUG] getVal with type x, but data type is: u8
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: START
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: END
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: STOP
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][DEBUG] Set logger level by DEBUG
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_lengths
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][INFO] [ProcessInferRequests] Request for 1 received.
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: step
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_lengths
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_ids
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_embedding_ranges
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_embedding_ranges
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: request_output_len
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] getVal with type i4, but data type is: u4
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] getVal with type i4, but data type is: u4
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: random_seed
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = long long unsigned int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = long long unsigned int; size_t = long unsigned int] start
[TM][DEBUG] Set logger level by DEBUG
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: step
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_lengths
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_ids
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_embedding_ranges
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_embedding_ranges
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: request_output_len
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] getVal with type i4, but data type is: u4
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] getVal with type i4, but data type is: u4
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: random_seed
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = long long unsigned int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = long long unsigned int; size_t = long unsigned int] start
[TM][DEBUG] void turbomind::LlamaV2<T>::forwardUnified(T*, T*, T*, void**, const int*, const int*, const int*, const int*, const float*, const bool*, size_t, int, int, int*, const turbomind::Sequence**) [with T = __half; size_t = long unsigned int]
[TM][DEBUG] void turbomind::ftNcclAllGather(const T*, T*, int, int, turbomind::NcclParam, cudaStream_t) [with T = __half; cudaStream_t = CUstream_st*] start
[TM][DEBUG] ncclDataType_t turbomind::getNcclDataType() [with T = __half] start
[TM][INFO] ------------------------- step = 20 -------------------------
[TM][INFO] [Forward] [0, 1), dc=0, pf=1, sum_q=21, sum_k=21, max_q=21, max_k=21
[TM][DEBUG] void turbomind::LlamaV2<T>::forwardUnified(T*, T*, T*, void**, const int*, const int*, const int*, const int*, const float*, const bool*, size_t, int, int, int*, const turbomind::Sequence**) [with T = __half; size_t = long unsigned int]
[TM][DEBUG] void turbomind::ftNcclAllGather(const T*, T*, int, int, turbomind::NcclParam, cudaStream_t) [with T = __half; cudaStream_t = CUstream_st*] start
[TM][DEBUG] ncclDataType_t turbomind::getNcclDataType() [with T = __half] start

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

irexyc commented 2 months ago

Have you set TM_DEBUG_LEVEL=DEBUG(keep loglevel debug), it will insert sync op between cuda functions and will be helpful to find root cause.

josephrocca commented 2 months ago

Hi @irexyc, thanks for your speedy response. I just tried it and "unfortunately" it fixes the issue :sweat_smile: I.e. inference works fine on v0.6.0 with 2x4090 if TM_DEBUG_LEVEL=DEBUG env variable is set.

Is there anything else you'd like me to try?

fanghostt commented 2 months ago

出现一样问题，同样必须执行TM_DEBUG_LEVEL=DEBUG，模型才能运行。不然多卡出现一张卡占用100%锁死问题。

lzhangzz commented 2 months ago

@josephrocca @fanghostt

Can you reproduce it with other models? I can't reproduce it with Qwen2-7B-AWQ or Llama3-70B-AWQ with v0.6.0 on 2 RTX 4090 GPUs.

fanghostt commented 2 months ago

@josephrocca @fanghostt

Can you reproduce it with other models? I can't reproduce it with Qwen2-7B-AWQ or Llama3-70B-AWQ with v0.6.0 on 2 RTX 4090 GPUs.

same problem has been shown on 2*A100 GPUS with Qwen2-72B-Instruct-GPTQ-Int4 and InternVL2-40B-AWQ models with lmdeploy version:v0.6.0 Our env may be a little bit different, we use Orion vGPU, not physical machine, any advice to solve this problems?

josephrocca commented 2 months ago

I can't reproduce it with Qwen2-7B-AWQ or Llama3-70B-AWQ with v0.6.0 on 2 RTX 4090 GPUs

@lzhangzz Note that this bug report is about Llama 2 70B. Can you try with Llama 2 70B AWQ instead of Llama 3? Here's my command again from the original post for convenience:

lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --model-name "lmdeploy/llama2-chat-70b-4bit" --server-port 3000 --tp 2 --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4  --log-level DEBUG

lvhan028 commented 2 months ago

@josephrocca Is it only reproducible using llama2-chat-70b-4bit?

lzhangzz commented 2 months ago

@josephrocca

Sorry for the confusion. Internet access is quite limited on our 4090 environment so I started with what I already have on the machine.

josephrocca commented 2 months ago

@lvhan028 I have tested multiple Llama 2 70B AWQ models (not just lmdeploy/llama2-chat-70b-4bit), across multiple GPU types. Unfortunately I haven't tested Llama 3 70B.

(I did try testing LLama 3 70B on 2x4090 just now, but for some reason hit a separate problem with an explicit OOM error - likely an unrelated issue that I just need to spend time to debug. I will look into that issue tomorrow and post a separate issue if needed, but it's likely something that is wrong on my end.)

lzhangzz commented 2 months ago

@josephrocca In my test with Llama3 70B AWQ on 2x4090, --cache-max-entry-count 0.5 is needed to avoid OOM.

InternLM / lmdeploy