[Bug] Cannot deploy qwenvl model

PredyDaddy commented 5 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

I tried to deploy qwenvl, I enter docker openmmlab/lmdeploy-builder:cuda11.8 and follow below operation:

export LMDEPLOY_VERSION=0.2.0
export PYTHON_VERSION=311
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl

then I use follow command

lmdeploy serve api_server /home/qwen/Qwen-VL-Chat --server-port 8000

and have error:

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/qwen/Qwen-VL-Chat'. Use `repo_type` argument if needed.

and I check similar issue and try to convert:

lmdeploy convert Qwen-VL /home/qwen/Qwen-VL-Chat --dst-path ./Qwen-VL

follow error:

AssertionError: 'Qwen-VL' is not supported. The supported models are: dict_keys(['base', 'llama', 'internlm', 'vicuna', 'wizardlm', 'internlm-chat-7b', 'internlm-chat', 'internlm-chat-7b-8k', 'internlm-chat-20b', 'internlm-20b', 'internlm2-7b', 'internlm2-20b', 'internlm2-chat-7b', 'internlm2-chat-20b', 'baichuan-7b', 'baichuan2-7b', 'puyu', 'llama2', 'llama-2', 'llama-2-chat', 'qwen-7b', 'qwen-14b', 'codellama', 'falcon', 'chatglm2-6b', 'solar', 'solar-70b', 'ultralm', 'ultracm', 'yi', 'yi-chat', 'yi-200k', 'yi-34b'])

Reproduction

I tried to deploy qwenvl, I enter docker openmmlab/lmdeploy-builder:cuda11.8 and follow below operation:

export LMDEPLOY_VERSION=0.2.0
export PYTHON_VERSION=311
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl

then I use follow command

lmdeploy serve api_server /home/qwen/Qwen-VL-Chat --server-port 8000

and have error:

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/qwen/Qwen-VL-Chat'. Use `repo_type` argument if needed.

and I check similar issue and try to convert:

lmdeploy convert Qwen-VL /home/qwen/Qwen-VL-Chat --dst-path ./Qwen-VL

follow error:

AssertionError: 'Qwen-VL' is not supported. The supported models are: dict_keys(['base', 'llama', 'internlm', 'vicuna', 'wizardlm', 'internlm-chat-7b', 'internlm-chat', 'internlm-chat-7b-8k', 'internlm-chat-20b', 'internlm-20b', 'internlm2-7b', 'internlm2-20b', 'internlm2-chat-7b', 'internlm2-chat-20b', 'baichuan-7b', 'baichuan2-7b', 'puyu', 'llama2', 'llama-2', 'llama-2-chat', 'qwen-7b', 'qwen-14b', 'codellama', 'falcon', 'chatglm2-6b', 'solar', 'solar-70b', 'ultralm', 'ultracm', 'yi', 'yi-chat', 'yi-200k', 'yi-34b'])

Environment

(base) [root@71de32d41137 lmdeploy]# lmdeploy check_env
/opt/conda/lib/python3.11/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
sys.platform: linux
Python: 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

LMDeploy: 0.2.0+b92105b
transformers: 4.39.2
gradio: 4.24.0
fastapi: 0.110.0
pydantic: 2.6.4

Error traceback

export LMDEPLOY_VERSION=0.2.0
export PYTHON_VERSION=311
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl

```bash
lmdeploy serve api_server /home/qwen/Qwen-VL-Chat --server-port 8000

and have error:

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/qwen/Qwen-VL-Chat'. Use `repo_type` argument if needed.

lmdeploy convert Qwen-VL /home/qwen/Qwen-VL-Chat --dst-path ./Qwen-VL

follow error:

AssertionError: 'Qwen-VL' is not supported. The supported models are: dict_keys(['base', 'llama', 'internlm', 'vicuna', 'wizardlm', 'internlm-chat-7b', 'internlm-chat', 'internlm-chat-7b-8k', 'internlm-chat-20b', 'internlm-20b', 'internlm2-7b', 'internlm2-20b', 'internlm2-chat-7b', 'internlm2-chat-20b', 'baichuan-7b', 'baichuan2-7b', 'puyu', 'llama2', 'llama-2', 'llama-2-chat', 'qwen-7b', 'qwen-14b', 'codellama', 'falcon', 'chatglm2-6b', 'solar', 'solar-70b', 'ultralm', 'ultracm', 'yi', 'yi-chat', 'yi-200k', 'yi-34b'])

lvhan028 commented 5 months ago

No need to do conversion any more

lvhan028 commented 5 months ago

Can you paste the result of ll /home/qwen/Qwen-VL-Chat? One more thing, is "/home/qwen/Qwen-VL-Chat" loaded by transformers succesfully?

PredyDaddy commented 5 months ago

Can you paste the result of ll /home/qwen/Qwen-VL-Chat? One more thing, is "/home/qwen/Qwen-VL-Chat" loaded by transformers succesfully?

Thanks for reply. I used follow code try

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("/home/qwen/Qwen-VL-Chat", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("/home/qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("/home/qwen/Qwen-VL-Chat", trust_remote_code=True)

# 1st dialogue turn
query = tokenizer.from_list_format([
    {'image': '/home/Qwen-VL/1.jpg'}, # Either a local path or an url
    {'text': '可以描述这个图片吗?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

this can succesfully. So the weight should be correct.

PredyDaddy commented 5 months ago

Can you paste the result of ll /home/qwen/Qwen-VL-Chat? One more thing, is "/home/qwen/Qwen-VL-Chat" loaded by transformers succesfully?

您好，我是在python的虚拟环境里面跑通了千问视觉的demo, 所以我确认这个权重是可以用的，然后qwenvl这个仓库的openai_api的脚本没有图片输入的功能所以我想使用您们的仓库部署API输入图片，然后就有了上面的问题

PredyDaddy commented 5 months ago

Can you paste the result of ll /home/qwen/Qwen-VL-Chat? One more thing, is "/home/qwen/Qwen-VL-Chat" loaded by transformers succesfully?

然后我也在想是不是环境的问题，我就去了容器里面跑容器里面使用下面的指令

lmdeploy serve api_server /app/qwen/Qwen-VL-Chat --server-port 8000

容器里面的报错是这种的，

(base) [root@71de32d41137 lmdeploy]# lmdeploy serve api_server /app/qwen/Qwen-VL-Chat --server-port 8000

/opt/conda/lib/python3.11/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
model_source: ModelSource.HF_MODEL
04/01 08:05:57 - turbomind - WARNING - kwargs model_name is deprecated to initialize model, use TurbomindEngineConfig instead.
Traceback (most recent call last):
  File "/opt/conda/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/cli/entrypoint.py", line 18, in run
    args.run(args)
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/cli/serve.py", line 237, in api_server
    run_api_server(args.model_path,
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/serve/openai/api_server.py", line 962, in serve
    VariableInterface.async_engine = AsyncEngine(
                                     ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/serve/async_engine.py", line 67, in __init__
    self._build_turbomind(model_path=model_path,
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/serve/async_engine.py", line 108, in _build_turbomind
    self.engine = tm.TurboMind.from_pretrained(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py", line 426, in from_pretrained
    return cls(model_path=local_path,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py", line 182, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py", line 296, in _from_hf
    output_model = OUTPUT_MODELS.get(output_format)(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/fp.py", line 26, in __init__
    super().__init__(input_model, cfg, to_file, out_dir)
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 141, in __init__
    self.cfg = self.get_config(cfg)
               ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/fp.py", line 30, in get_config
    final_cfg = super().get_config(cfg).__dict__
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 157, in get_config
    final_cfg.update(self.input_model.model_info())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/source_model/llama.py", line 185, in model_info
    norm_eps = model_arg['rms_norm_eps']
               ~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 'rms_norm_eps'

然后我看了您在其他 #1008 里面的回答我就在指令里面加了

lmdeploy serve api_server /app/qwen/Qwen-VL-Chat --server-port 8000 --model-name qwen-7b

救出现下面的报错

(base) [root@71de32d41137 lmdeploy]# lmdeploy serve api_server /app/qwen/Qwen-VL-Chat --server-port 8000 --model-name qwen-7b
/opt/conda/lib/python3.11/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
model_source: ModelSource.HF_MODEL
model_config:

[llama]
model_name = qwen-7b
tensor_para_size = 1
head_num = 32
kv_head_num = 32
vocab_size = 151936
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 1
start_id = 0
end_id = 151643
session_len = 8200
weight_type = bf16
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = -1
num_tokens_per_iter = 0
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
rope_scaling_factor = 0.0
use_logn_attn = 1

get 323 model params
Traceback (most recent call last):                                                                 
  File "/opt/conda/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/cli/entrypoint.py", line 18, in run
    args.run(args)
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/cli/serve.py", line 237, in api_server
    run_api_server(args.model_path,
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/serve/openai/api_server.py", line 962, in serve
    VariableInterface.async_engine = AsyncEngine(
                                     ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/serve/async_engine.py", line 67, in __init__
    self._build_turbomind(model_path=model_path,
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/serve/async_engine.py", line 108, in _build_turbomind
    self.engine = tm.TurboMind.from_pretrained(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py", line 426, in from_pretrained
    return cls(model_path=local_path,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in __init__
    self.stop_words = _stop_words(self.model.stop_words, self.tokenizer)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/utils.py", line 111, in _stop_words
    stop_indexes += tokenizer.indexes_containing_token(stop_word)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/tokenizer.py", line 380, in indexes_containing_token
    return self.model.indexes_containing_token(token)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/tokenizer.py", line 235, in indexes_containing_token
    self.token2id = {
                    ^
  File "/opt/conda/lib/python3.11/site-packages/lmdeploy/tokenizer.py", line 236, in <dictcomp>
    self.model.decode(i): i
    ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3825, in decode
    return self._decode(
           ^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/Qwen-VL-Chat/tokenization_qwen.py", line 354, in _decode
    token_ids = _replace_closed_tag(token_ids, self.img_start_id, self.img_end_id, _decode_imgurl)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/Qwen-VL-Chat/tokenization_qwen.py", line 94, in _replace_closed_tag
    raise ValueError("Unclosed image token")
ValueError: Unclosed image token

lvhan028 commented 5 months ago

请把 lmdeploy 升级到 v0.2.6

vl 模型的服务是在 v0.2.6 支持的

lvhan028 commented 5 months ago

服务的命令和你之前用的一样： ···shell lmdeploy serve api_server /home/qwen/Qwen-VL-Chat --server-port 8000 ···

PredyDaddy commented 5 months ago

您好，我在容器内和本机环境使用下面的指令安装0.2.6，并且我都用pip list确认过lmdeploy的版本是0.2.6

但是两个都报了下面的错误，

RuntimeError: [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384

lvhan028 commented 5 months ago

看你的环境中，cuda runtime 是12.1的。 lmdeploy 的 pypi package 依赖的是 cuda 11.8 如果要在 cuda 12+ 环境中安装 lmdeploy，需要从 release note 的附件中获取 whl 包。详细说明在 README 中的 installation 部分。

请问你的安装方式是怎样的呢？

PredyDaddy commented 5 months ago

看你的环境中，cuda runtime 是12.1的。 lmdeploy 的 pypi package 依赖的是 cuda 11.8 如果要在 cuda 12+ 环境中安装 lmdeploy，需要从 release note 的附件中获取 whl 包。详细说明在 README 中的 installation 部分。

请问你的安装方式是怎样的呢？

您好，我是使用pip install 指定版本安装的，然后他自动会帮我装上cuda12的runtime在这个环境里面，即使我的镜像里面已经有了cuda 11.8的runtime

(base) [root@71de32d41137 lmdeploy]# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

PredyDaddy commented 5 months ago

看你的环境中，cuda runtime 是12.1的。 lmdeploy 的 pypi package 依赖的是 cuda 11.8 如果要在 cuda 12+ 环境中安装 lmdeploy，需要从 release note 的附件中获取 whl 包。详细说明在 README 中的 installation 部分。

请问你的安装方式是怎样的呢？我同时也尝试过下载这个包，然后pip 安装, lmdeploy-0.2.6-cp311-cp311-manylinux2014_x86_64.whl, 然后我在使用这个包的时候，即使我已经是
export CUDA_VISIBLE_DEVICES=1

他仍然会有这个报错

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 3.69 MiB is free. Process 181742 has 23.68 GiB memory in use. Of the allocated memory 1.36 GiB is allocated by PyTorch, and 43.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

以上操作都是在镜像openmmlab/lmdeploy-builder:cuda11.8里面做的

lvhan028 commented 5 months ago

换镜像 openmmlab/lmdeploy:v0.2.6，这里面已经有编译好的 lmdeploy 了，不用做任何安装。把 /home/qwen/Qwen-VL-Chat 映射到 container 内后，可以执行 lmdeploy serve 命令。考虑到你用的是 3090，存在 OOM 的风险。启动 lmdeploy serve api_server 的时候，增加参数 --cache-max-entry-count 0.4 如果还有 OOM，继续调低这个参数。我找一台 3090 试试看

PredyDaddy commented 5 months ago

换镜像 openmmlab/lmdeploy:v0.2.6，这里面已经有编译好的 lmdeploy 了，不用做任何安装。把 /home/qwen/Qwen-VL-Chat 映射到 container 内后，可以执行 lmdeploy serve 命令。考虑到你用的是 3090，存在 OOM 的风险。启动 lmdeploy serve api_server 的时候，增加参数 --cache-max-entry-count 0.4 如果还有 OOM，继续调低这个参数。我找一台 3090 试试看

非常感谢，我没有去拉取最新的镜像，我试着加上--cache-max-entry-count 0.4就可以了

lmdeploy serve api_server /app/qwen/Qwen-VL-Chat --server-port 8000 --cache-max-entry-count 0.4

然后请求

from lmdeploy.serve.openai.api_client import APIClient

api_client = APIClient(f'http://0.0.0.0:8000')
model_name = api_client.available_models[0]
messages = [{
    'role':
    'user',
    'content': [{
        'type': 'text',
        'text': 'Describe the image please',
    }, {
        'type': 'image_url',
        'image_url': {
            'url':
            'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
        },
    }]
}]
for item in api_client.chat_completions_v1(model=model_name,
                                           messages=messages):
    print(item)

结果返回:

{'id': '1', 'object': 'chat.completion', 'created': 1712037535, 'model': 'qwen', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'A tiger is resting on the grass outdoors.'}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 284, 'total_tokens': 294, 'completion_tokens': 10}}

非常感谢！！！

lvhan028 commented 5 months ago

Finally! Enjoy:)

PredyDaddy commented 5 months ago

For the people who don't know the Chinese, my advise when facing the same problem with me. is using the latest container: openmmlab/lmdeploy:v0.2.6, Inside the container there is a lmdeploy. No need to install by yourself.

And if you have the problem with OOM, try add --cache-max-entry-count 0.4 after your command.

Thanks lmdeploy teams!!!

InternLM / lmdeploy