InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.31k stars 388 forks source link

Support internvl-chat for pytorch engine #1797

Closed RunningLeon closed 3 months ago

RunningLeon commented 3 months ago

Motivation

Support InternVL-Chat models for pytorch engine:

Note

Required PRs

as requested in issue https://github.com/InternLM/lmdeploy/issues/1794

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

None

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.
RunningLeon commented 3 months ago

@grimoire

performance on llama3-8b

cmd

python benchmark/profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json /nvme/shared_data/llama3/Meta-Llama-3-8B-Instruct --backend pytorch -n 3000

main branch

concurrency: 256
elapsed_time: 199.272s

first token latency(s)(min, max, ave): 1.238, 11.655, 2.881
per-token latency(s) percentile(50, 75, 95, 99): [0.04, 0.042, 0.278, 0.34]

number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 3074.616 token/s
token throughput (prompt + completion token): 6470.873 token/s
RPS (request per second): 15.055 req/s
RPM (request per minute): 903.288 req/min
--------------------------------------------------

this PR

concurrency: 256
elapsed_time: 205.748s

first token latency(s)(min, max, ave): 1.033, 11.620, 2.934
per-token latency(s) percentile(50, 75, 95, 99): [0.041, 0.044, 0.279, 0.343]

number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 2977.838 token/s
token throughput (prompt + completion token): 6267.192 token/s
RPS (request per second): 14.581 req/s
RPM (request per minute): 874.855 req/min
--------------------------------------------------
lvhan028 commented 3 months ago
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 60.00 MiB. GPU 5 has a total capacty of 79.14 GiB of which 5.19 MiB is free. Process 151190 has 78.15 GiB memory in use. Including non-PyTorch memory, this process has 992.00 MiB memory in use. Of the allocated memory 489.86 MiB is allocated by PyTorch, and 12.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
lvhan028 commented 3 months ago

My test code is as follows, and the test command is python examples/workspace/test_vl_pipeline.py /nvme/shared/InternVL-Chat-V1-5 --backend pytorch --cache_max_entry_count 0.5

from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig
from lmdeploy.vl import load_image
import fire
import time

def apply_turbomind(model_path, **kwargs):
    engine_config = TurbomindEngineConfig.from_dict({}, allow_none=True)

    for key, value in kwargs.items():
        if hasattr(TurbomindEngineConfig, key):
            setattr(engine_config, key, value)

    pipe = pipeline(model_path, backend_config=engine_config)

    return pipe

def apply_pytorch(model_path, **kwargs):
    engine_config = PytorchEngineConfig()
    for key, value in kwargs.items():
        print(key, value)
        if hasattr(PytorchEngineConfig, key):
            setattr(engine_config, key, value)

    pipe = pipeline(model_path, backend_config=engine_config)

    return pipe

def main(model_path, backend, **kwargs):
    start = time.perf_counter()
    if backend == 'turbomind':
        pipe = apply_turbomind(model_path, **kwargs)
    else:
        pipe = apply_pytorch(model_path, **kwargs)
    end = time.perf_counter()
    print(f'building pipeline cost: {end - start} s')

    input('press any key to continue')
    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    response = pipe(('describe this image', image))
    print(response)
    input('press any key to exit')

if __name__ == "__main__":
    fire.Fire(main)
xin0623 commented 2 months ago

lmdeploy-5.0.0+cu118-cp39 Is this version supported? Why do I still get the same error?

lvhan028 commented 2 months ago

This PR just was merged into the main branch but hasn't been released yet. We are going to release v0.5.1 around 15th, July.

v3ss0n commented 2 months ago

I had build using this PR , and when inference i got this error :

Mini-InternVL-Chat-4B-V1-5

AssertionError: turbomind does not support /home/nozander/.cache/huggingface/hub/models--OpenGVLab--Mini-InternVL-Chat-4B-V1-5/snapshots/920f7428f246dbb460a1d90c4a3ee9dc696158cc. Plz try pytorch engine instead.

Pytorch engine dosen't work too.

RunningLeon commented 2 months ago

@v3ss0n hi, pls. use lmdeploy check_env to check your env info and post a sample code to reproduce.

v3ss0n commented 2 months ago
sys.platform: linux
Python: 3.12.4 (main, Jun  7 2024, 06:33:07) [GCC 14.1.1 20240522]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: Quadro P5000
CUDA_HOME: /opt/cuda
NVCC: Cuda compilation tools, release 12.5, V12.5.40
GCC: gcc (GCC) 14.1.1 20240522
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.0+
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.0
pydantic: 2.8.2
triton: 2.2.0
NVIDIA Topology:
    GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  0-7 0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
zhyncs commented 2 months ago

ref https://github.com/zhyncs/lmdeploy-build/releases/tag/9f3e748

v3ss0n commented 2 months ago

ok gonna try that build

v3ss0n commented 2 months ago
LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32
zsh: IOT instruction (core dumped)  lmdeploy serve api_server OpenGVLab/Mini-InternVL-Chat-4B-V1-5
v3ss0n commented 2 months ago

I tried calling APi via api_server , and then i got coredumped.