fong-git commented 1 week ago

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用Transformer、vllm、LMdeploy对InternVL2-2B进行推理，max_num_patch都设置为12，推理结果发现: Transformer平均691ms/条 VLLM平均308ms/条 LMdeploy平均523ms/条对VLLM和LMdeploy耗时进行分析发现，vllm的vit部分平均耗时9ms，LMdeploy的vit部分平均耗时323ms。 LMdeploy的vit统计时间在VLAsyncEngine类的_get_prompt_input中统计``

Reproduction

from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig from lmdeploy.vl import load_image import os import time

PROMPT_SYSTEM = """ 根据图片，判断该文档所属的文档类别。请严格按照如下的格式进行回复，不要输出多余的解释（注意不要强行给文档分一个不正确的类别：对于不属于特定类别的文档，判别为‘其他文档’）：文档类别：该文档所属的文档类别 """

model = 'model/OpenGVLab/InternVL2-1B'

model = 'model/OpenGVLab/InternVL2-2B' pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192,model_format='hf')) img_path = './cs_function_recommendation_bak/test_data/image' imgs = os.listdir(img_path) totle_time = 0 vit_time_total =0 for img in imgs[:100]: image = load_image(os.path.join(img_path,img))

start = time.time()
response = pipe((PROMPT_SYSTEM, image))
end = time.time()
time_ = end - start
totle_time += time_
vit_time_total += response.vit_time
print(response.text,f"\nvit 时间：{response.vit_time},总耗时：{time_}")

print(vit_time_total) print(totle_time)

Environment

sys.platform: linux
Python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7,8,9: NVIDIA A100 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.18.1+cu121
LMDeploy: 0.5.3+aa00ed0
transformers: 4.45.0.dev0
gradio: Not Found
fastapi: 0.115.0
pydantic: 2.9.2
triton: 2.3.1

Error traceback

No response

sjzhou4 commented 4 days ago

嗨，我也发现了类似的问题，我这边简单分析了下，lmdeploy比vllm在vit部分平均增加的耗时是由于lmdeploy需要将vit的feature遍历从gpu到cpu，也就是下面图中的x.cpu()引起的。vllm不需要此环节

fong-git commented 4 days ago

哈喽下午好！非常感谢你的解答，我这边把x.cpu()注释掉以后 vit的处理时间还是很慢呢，还是要300ms左右。不知道你那边注释了以后推理速度怎么样呢

丰 @.***

------------------ 原始邮件 ------------------ 发件人: "InternLM/lmdeploy" @.>; 发送时间: 2024年10月23日(星期三) 下午2:48 @.>; @.**@.>; 主题: Re: [InternLM/lmdeploy] [Bug] InternVL2-2B的推理速度慢，发现是视觉特征提取的耗时很长 (Issue #2604)

嗨，我也发现了类似的问题，我这边简单分析了下，lmdeploy比vllm在vit部分平均增加的耗时是由于lmdeploy需要将vit的feature遍历从gpu到cpu，也就是下面图中的x.cpu()引起的

image.png (view on web)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

sjzhou4 commented 4 days ago

我这边注释掉之后基本上就少了对应的to cpu的时间，而且vit整体的性能我测试来看lmdeploy和vllm是近似的（去掉to cpu的话）。

sjzhou4 commented 4 days ago

但是这里虽然是去掉了to cpu时间，后面还是会进行gpu到cpu同步的，lmdeploy逻辑是这样实现的

fong-git commented 4 days ago

但是这里虽然是去掉了to cpu时间，后面还是会进行gpu到cpu同步的，lmdeploy逻辑是这样实现的

所以是lmdeploy这里即便注释掉tocpu()了，后面还是会进行GPU到CPU的同步是吗，就是整体的时间哈还是不会减少吗

irexyc commented 4 days ago

@fong-git

我不确定你是怎么统计的时间，比较准确的方式是去除预处理时间，然后 vision model forward 前后对 stream 进行同步。下面是我之前测的两个vision模型 forward 的时间。

lmdeploy 没有对 vision 模型做 tp，所以 tp 对 lmdeploy 的 vision部分没有收益。大 batch tp 下会比 vllm 慢一些，不过现在 vision 模型都比较大，显存不一定支持跑那么大的 batch。

@fong-git @sjzhou4

关于 to cpu 的问题，之前 pytorch backend 遇到一个问题，就是如果不做 to cpu 的话，得到的图片特征结果会不正确，这可能跟 vision模型跑在单独的线程有关系 (asyncio executor)。to cpu 对单个请求的时延会有影响，对整体吞吐应该没影响，因为并不会阻塞请求。

另外感觉 to cpu 其实省略不了，因为如果后面要支持 prefix caching的话，是要保存一定数量的图片特征的，这样可以避免在对话过程中重复提取特征，而因为显存的原因，特征存在内存中是一个比较好的方式。

fong-git commented 3 days ago

@irexyc 我测了vision model的单纯GPU计算feature的时间和vllm是差不多的，但是在VLAsyncEngine类的_get_prompt_input中统计features = await self.vl_encoder.async_infer(images)的时间会比vllm慢很多，导致实际测下来的推理速度比vllm慢

Dimensionzw commented 3 days ago

@fong-git 我这边实测在tp均为4的情况下，lmdeploy比vllm慢500ms左右，feature推理时间基本一致，问题就在于to cpu这部分，vllm是直接把GPU 的torch tensor传入后续流程的：

def merge_multimodal_embeddings(input_ids: torch.Tensor,
                                inputs_embeds: torch.Tensor,
                                multimodal_embeddings: NestedTensors,
                                placeholder_token_id: int) -> torch.Tensor:
    """
    Merge ``multimodal_embeddings`` into ``inputs_embeds`` by overwriting the
    positions in ``inputs_embeds`` corresponding to placeholder tokens in
    ``input_ids``.

    Note:
        This updates ``inputs_embeds`` in place.
    """
    mask = (input_ids == placeholder_token_id)
    num_expected_tokens = mask.sum().item()
    assert isinstance(num_expected_tokens, int)

    flattened = _flatten_embeddings(multimodal_embeddings)
    if flattened.shape[0] != num_expected_tokens:
        expr = _embedding_count_expression(multimodal_embeddings)
        raise ValueError(
            f"Attempted to assign {expr} = {flattened.shape[0]} "
            f"multimodal tokens to {num_expected_tokens} placeholders")

    inputs_embeds[mask] = flattened
    return inputs_embeds

sjzhou4 commented 3 days ago

@irexyc @fong-git 是的，lmdeploy的to cpu 是不能缺少的，就像 @Dimensionzw 说的那样，lmdeploy和vllm的架构设计有区别的，lmdeploy更多的是在上层进行模版拼接、特征提取等工作，最后把这些inputs信息传递到turomind backend端进行处理。而且像 @irexyc 说的，后面的prefix caching等，也可能会使用内存、甚至硬盘来存储特征信息，进而优化显存的占用，这些都是需要to cpu操作的

InternLM / lmdeploy

[Bug] InternVL2-2B的推理速度慢，发现是视觉特征提取的耗时很长 #2604

Checklist

Describe the bug

Reproduction

model = 'model/OpenGVLab/InternVL2-1B'

Environment

Error traceback