[Bug] 无法使用双卡的显存来共同加载一个模型

keakon commented 1 month ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

我想用 2 张 4090 来部署 Qwen/Qwen2-72B-Instruct-AWQ 模型，在 Ollama 上大概 40 GB 显存就能跑，vLLM 上大概要 44 GB 以上，而 lmdeploy 似乎没有办法。

Reproduction

我已经尽量把所有参数都设成最小值了：

lmdeploy serve api_server Qwen/Qwen2-72B-Instruct-AWQ --model-format awq --tp 2 --max-batch-size 1 --cache-max-entry-count 0.01 --session-len 1 --cache-block-seq-len 1 --quant-policy 4

Environment

sys.platform: linux
Python: 3.11.9 (main, May  4 2024, 13:19:27) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.0+
transformers: 4.42.4
gradio: 3.50.2
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.2.0

Error traceback

Fetching 21 files: 100%|███████████████████████████████████████████████| 21/21 [00:00<00:00, 14030.01it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Convert to turbomind format:   0%|                                                 | 0/80 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ps/.pyenv/versions/lmdeploy/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/cli/entrypoint.py", line 43, in run
    args.run(args)
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/cli/serve.py", line 310, in api_server
    run_api_server(args.model_path,
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/serve/openai/api_server.py", line 1197, in serve
    VariableInterface.async_engine = pipeline_class(
                                     ^^^^^^^^^^^^^^^
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/serve/async_engine.py", line 189, in __init__
    self._build_turbomind(model_path=model_path,
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/serve/async_engine.py", line 234, in _build_turbomind
    self.engine = tm.TurboMind.from_pretrained(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py", line 342, in from_pretrained
    return cls(model_path=pretrained_model_name_or_path,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py", line 144, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py", line 259, in _from_hf
    output_model.export()
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 283, in export
    self.export_misc(bin)
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 314, in export_misc
    self.export_weight(emb, 'tok_embeddings.weight')
  File "/home/ps/.pyenv/versions/3.11.9/envs/lmdeploy/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 229, in export_weight
    torch_tensor = param.cuda().contiguous()
                   ^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB. GPU 1 has a total capacity of 23.65 GiB of which 1011.56 MiB is free. Including non-PyTorch memory, this process has 22.64 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

zhyncs commented 1 month ago

Why do you want to set the cache-max-entry-count to 0.01?

keakon commented 1 month ago

Why do you want to set the cache-max-entry-count to 0.01?

To save memory. The weights will take about 90% of the memory, there isn't another 50% for the cache.

keakon commented 1 month ago

I found a solution from a video:

lmdeploy convert --model-format awq --group-size 128 --tp 2 qwen Qwen/Qwen2-72B-Instruct-AWQ

It's weird that lmdeploy convert is not documented.

zhyncs commented 1 month ago

It's weird that lmdeploy convert is not documented.

https://lmdeploy.readthedocs.io/en/latest/inference/load_hf.html#a-model-converted-by-lmdeploy-convert

keakon commented 1 month ago

It's weird that lmdeploy convert is not documented.

https://lmdeploy.readthedocs.io/en/latest/inference/load_hf.html#a-model-converted-by-lmdeploy-convert

However, it doesn't mention that the model needs to be converted using --tp 2. Other LLM serving engines can use a normal model when serving with --tp 2.

keakon commented 1 month ago

I tried this command to achieve maximum performance, and it is about 25% faster than vLLM.

lmdeploy serve api_server ./workspace --model-format awq --tp 2 --max-batch-size 32 --session-len 2048 --cache-max-entry-count 0.45 --quant-policy 8 --enable-prefix-caching

Using "--quant-policy 4" is about 30% faster than vLLM, but I haven't test its accuracy loss.

The disadvantage is the lack of eager mode. In order to serve long contexts like 8192 tokens, I have to reduce "--cache-max-entry-count" to 0.2, which results in poorer performance (almost the same performance as vLLM). There is an option named "--max-seq-len-to-capture" in vLLM, so that when handling long contexts, it can fall back to eager mode to avoid OOM.

keakon commented 1 month ago

My initial human evaluation shows that Q8 quantization and Q0 quantization are almost identical in accuracy, while Q4 is slightly worse. However, in processing texts of around 1k tokens, Q4 is approximately 30% faster than Q8.

zhyncs commented 1 month ago

I will optimize the accuracy of KV Cache Int4 recently, please stay tuned.

github-actions[bot] commented 1 month ago

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

github-actions[bot] commented 1 month ago

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

InternLM / lmdeploy