InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.58k stars 419 forks source link

Lora adapters only work with pytorch enngine. #1582

Closed bks5881 closed 5 months ago

bks5881 commented 6 months ago

Checklist

Describe the bug

I would like to launch a openai compatible endpoint with lora adapters but want to use TurbomindEngine and not PytorchEngine as the inference is very slow on it.

Reproduction

lmdeploy serve api_server v2ray/Llama-3-70B-Instruct --tp 4 --server-port 40047 --server-name 0.0.0.0 --adapters /home/user/project/trained_lora

Environment

sys.platform: linux
Python: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA A100 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.1.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.2+cu121
LMDeploy: 0.4.1+
transformers: 4.40.2
gradio: 4.16.0
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.1.0

Error traceback

https://github.com/InternLM/lmdeploy/blob/a27dac3529dc5e1f8bedb4fa1c00a45bea2644fb/lmdeploy/cli/serve.py#L277 Should support adapters.
lvhan028 commented 6 months ago

Turbomind engine doesn't support s-lora

lzhangzz commented 6 months ago

How many adapters do you need for 1 server instance?

bks5881 commented 6 months ago

I would say to begin with, atleast 1?

lvhan028 commented 6 months ago

reopen it for further discussion

lzhangzz commented 5 months ago

I would say to begin with, atleast 1?

If only 1 adpater is needed, you can just merge it into the original model. And this will give you fastest speed.

In the near future (maybe in June), turbomind is going to support the simpler case where storing all apdaters in VRAM is acceptable.

kratorado commented 5 months ago

I would say to begin with, atleast 1?

If only 1 adpater is needed, you can just merge it into the original model. And this will give you fastest speed.

In the near future (maybe in June), turbomind is going to support the simpler case where storing all apdaters in VRAM is acceptable.

When we need an adapter to do some certain works while the base one (llama/internLM/qwen etc) still can do the common work. Merging lora adapter sometimes makes the model only better at new tasks. This will be very helpful.

bks5881 commented 5 months ago

Well, ideally i want to avoid merging weights as the

bks5881 commented 5 months ago

ideally want to have 5-5 Lora adapters without merging weights