Closed RunningLeon closed 3 months ago
@grimoire
cmd
python benchmark/profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json /nvme/shared_data/llama3/Meta-Llama-3-8B-Instruct --backend pytorch -n 3000
concurrency: 256
elapsed_time: 199.272s
first token latency(s)(min, max, ave): 1.238, 11.655, 2.881
per-token latency(s) percentile(50, 75, 95, 99): [0.04, 0.042, 0.278, 0.34]
number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 3074.616 token/s
token throughput (prompt + completion token): 6470.873 token/s
RPS (request per second): 15.055 req/s
RPM (request per minute): 903.288 req/min
--------------------------------------------------
concurrency: 256
elapsed_time: 205.748s
first token latency(s)(min, max, ave): 1.033, 11.620, 2.934
per-token latency(s) percentile(50, 75, 95, 99): [0.041, 0.044, 0.279, 0.343]
number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 2977.838 token/s
token throughput (prompt + completion token): 6267.192 token/s
RPS (request per second): 14.581 req/s
RPM (request per minute): 874.855 req/min
--------------------------------------------------
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 60.00 MiB. GPU 5 has a total capacty of 79.14 GiB of which 5.19 MiB is free. Process 151190 has 78.15 GiB memory in use. Including non-PyTorch memory, this process has 992.00 MiB memory in use. Of the allocated memory 489.86 MiB is allocated by PyTorch, and 12.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My test code is as follows, and the test command is python examples/workspace/test_vl_pipeline.py /nvme/shared/InternVL-Chat-V1-5 --backend pytorch --cache_max_entry_count 0.5
from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig
from lmdeploy.vl import load_image
import fire
import time
def apply_turbomind(model_path, **kwargs):
engine_config = TurbomindEngineConfig.from_dict({}, allow_none=True)
for key, value in kwargs.items():
if hasattr(TurbomindEngineConfig, key):
setattr(engine_config, key, value)
pipe = pipeline(model_path, backend_config=engine_config)
return pipe
def apply_pytorch(model_path, **kwargs):
engine_config = PytorchEngineConfig()
for key, value in kwargs.items():
print(key, value)
if hasattr(PytorchEngineConfig, key):
setattr(engine_config, key, value)
pipe = pipeline(model_path, backend_config=engine_config)
return pipe
def main(model_path, backend, **kwargs):
start = time.perf_counter()
if backend == 'turbomind':
pipe = apply_turbomind(model_path, **kwargs)
else:
pipe = apply_pytorch(model_path, **kwargs)
end = time.perf_counter()
print(f'building pipeline cost: {end - start} s')
input('press any key to continue')
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
input('press any key to exit')
if __name__ == "__main__":
fire.Fire(main)
lmdeploy-5.0.0+cu118-cp39 Is this version supported? Why do I still get the same error?
This PR just was merged into the main branch but hasn't been released yet. We are going to release v0.5.1 around 15th, July.
I had build using this PR , and when inference i got this error :
Mini-InternVL-Chat-4B-V1-5
AssertionError: turbomind does not support /home/nozander/.cache/huggingface/hub/models--OpenGVLab--Mini-InternVL-Chat-4B-V1-5/snapshots/920f7428f246dbb460a1d90c4a3ee9dc696158cc. Plz try pytorch engine instead.
Pytorch engine dosen't work too.
@v3ss0n hi, pls. use lmdeploy check_env
to check your env info and post a sample code to reproduce.
sys.platform: linux
Python: 3.12.4 (main, Jun 7 2024, 06:33:07) [GCC 14.1.1 20240522]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: Quadro P5000
CUDA_HOME: /opt/cuda
NVCC: Cuda compilation tools, release 12.5, V12.5.40
GCC: gcc (GCC) 14.1.1 20240522
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.9.2
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.17.2+cu121
LMDeploy: 0.5.0+
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.0
pydantic: 2.8.2
triton: 2.2.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-7 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
ok gonna try that build
LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32
zsh: IOT instruction (core dumped) lmdeploy serve api_server OpenGVLab/Mini-InternVL-Chat-4B-V1-5
I tried calling APi via api_server , and then i got coredumped.
Motivation
Support InternVL-Chat models for pytorch engine:
Note
Required PRs
1641
1825
as requested in issue https://github.com/InternLM/lmdeploy/issues/1794
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
None
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist