[Bug] Internvl2-llama3-76B 在8卡V100上报错不支持 flash attention

thesby commented 1 month ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

我用8卡V100启动Internvl2-llama3-76B，在运行阶段报错

Reproduction

python -m lmdeploy serve api_server InternVL2-Llama3-76B --model-name internvl2-internlm2 --tp 8 --quant-policy 8

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.52
GCC: gcc (GCC) 10.2.1 20200825 (Alibaba 10.2.1-3 2.17)
PyTorch: 2.2.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.0+cu121
LMDeploy: 0.5.1+
transformers: 4.42.4
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.8.2
triton: 2.2.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV1     NV2     SYS     SYS     NV2     SYS     0-39,48-87              N/A
GPU1    NV1      X      NV2     NV1     SYS     SYS     SYS     NV2     0-39,48-87              N/A
GPU2    NV1     NV2      X      NV2     NV1     SYS     SYS     SYS     0-39,48-87              N/A
GPU3    NV2     NV1     NV2      X      SYS     NV1     SYS     SYS     0-39,48-87              N/A
GPU4    SYS     SYS     NV1     SYS      X      NV2     NV1     NV2     0-39,48-87              N/A
GPU5    SYS     SYS     SYS     NV1     NV2      X      NV2     NV1     0-39,48-87              N/A
GPU6    NV2     SYS     SYS     SYS     NV1     NV2      X      NV1     0-39,48-87              N/A
GPU7    SYS     NV2     SYS     SYS     NV2     NV1     NV1      X      0-39,48-87              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

![20240718102604](https://github.com/user-attachments/assets/640e54c8-80cc-4d59-b4b4-57b24f1d1e23)

thesby commented 1 month ago

 File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 22, in _raise_exception_on_finish
    task.result()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 151, in forward
    outputs = self.model.forward(inputs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 172, in forward
    return self._forward_func(images)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 153, in _forward_v1_5
    outputs = self.model.extract_feature(outputs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_internvl_chat.py", line 168, in extract_feature
    vit_embeds = self.vision_model(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 418, in forward
    encoder_outputs = self.encoder(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 354, in forward
    layer_outputs = encoder_layer(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 296, in forward
    hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states)) * self.ls1)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 252, in forward
    x = self._naive_attn(hidden_states) if not self.use_flash_attn else self._flash_attn(hidden_states)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 244, in _flash_attn
    context, _ = self.inner_attn(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 77, in forward
    output = flash_attn_unpadded_qkvpacked_func(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 887, in flash_attn_varlen_qkvpacked_func
    return FlashAttnVarlenQKVPackedFunc.apply(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 288, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 85, in _flash_attn_varlen_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

irexyc commented 1 month ago

可以把 flash_attn 卸载了看看。

zhyncs commented 1 month ago

Flash Attention requires architecture greater than Ampere.

lvhan028 commented 1 month ago

LMDeploy部署 VLM 模型的时候，视觉部分的模型推理是复用上游的算法库。

从上面日志上看，vit 模块用到了 flash attn。查了下上游库代码：

try:
    try:  # v1
        from flash_attn.flash_attn_interface import \
            flash_attn_unpadded_qkvpacked_func
    except:  # v2
        from flash_attn.flash_attn_interface import \
            flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func

    from flash_attn.bert_padding import pad_input, unpad_input

    has_flash_attn = True
except:
    print('FlashAttention is not installed.')
    has_flash_attn = False

它根据环境中是否有flash attn 决定要不要用。

所以，在V100上，不要安装 flash-attn。而且flash-attn也不支持V100架构。你可以把 flash-attn卸载掉，就像@irexyc建议的那样。这样 vit 就不用 flash attention了。

而LLM部分则由 LMDeploy 引擎负责推理的，它实现的 flash attention 支持 V100 架构。

thesby commented 1 month ago

感谢，我卸载 flash_attn 后发现无法启动，看到报错在 transformer_engine，然后把 transformer_engine 也卸载掉就好了

thesby commented 1 month ago

服务可以正常启动了，但是调用时还是报错：

Exception in callback <function _raise_exception_on_finish at 0x7f9ff2b05fc0>
handle: <Handle _raise_exception_on_finish>
Traceback (most recent call last):
  le "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 26, in _raise_exception_on_finish
  ise e
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 22, in _raise_exception_on_finish
  sk.result()
File "/opt/conda/envs/python3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run
  sult = self.fn(*self.args, **self.kwargs)
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 151, in forward
  tputs = self.model.forward(inputs)
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
  turn func(*args, **kwargs)
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 172, in forward
  turn self._forward_func(images)
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 153, in _forward_v1_5
  tputs = self.model.extract_feature(outputs)
  Fi "/root/.cache/huggingface/modules/transformers_modules/modeling_internvl_chat.py", line 168, in extract_feature
  t_embeds = self.vision_model(
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
  turn self._call_impl(*args, **kwargs)
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
  turn forward_call(*args, **kwargs)
  Fi "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 418, in forward
  coder_outputs = self.encoder(
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
  turn self._call_impl(*args, **kwargs)
  Fi "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 354, in forward
    layer_outputs = encoder_layer(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 296, in forward
    hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states)) * self.ls1)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 252, in forward
    x = self._naive_attn(hidden_states) if not self.use_flash_attn else self._flash_attn(hidden_states)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_intern_vit.py", line 217, in _naive_attn
    qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

irexyc commented 1 month ago

启动服务后显存剩余的情况是怎样的，可以把启动后的nvidia-smi的结果贴上来。

如果剩余显存比较小的话，可以在启动的时候加上 --cache-max-entry-count 0.2 来试试。

thesby commented 1 month ago

重新创建一个python 环境就好了

InternLM / lmdeploy