InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.1k stars 373 forks source link

[Bug] 使用lmdeploy方法部署InternVL-Chat-V1-5-AWQ,0.4.2版本openai客户端可以正常访问,0.5.0版本会卡死,没有任何响应,新增的cogvlm2也有类似现象。 #1992

Open kklots opened 1 month ago

kklots commented 1 month ago

Checklist

Describe the bug

使用0.5.0版本部署InternVL-Chat-V1-5-AWQ,python客户端使用openai方式发送请求后,服务端显示收到消息,但GPU的消耗资源无任何变化,客户端一直处于等待响应状态。

Reproduction

显卡信息: 4张魔改版2080ti 22G 部署代码: export CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server /data2/model_zoo/InternVL-Chat-V1-5-AWQ --backend turbomind --model-format awq --tp 4 --server-name 0.0.0.0 --server-port 8086 客户端访问代码:

python

def send_request(prompt, image_path, temperature, top_p, stream = True): base64_image = encode_image(image_path) response = client.chat.completions.create( model=model_name, messages=[{ 'role': 'user', 'content': [{ 'type': 'text', 'text': prompt, }, { 'type': 'image_url', 'image_url': { 'url': f"data:image/jpeg;base64,{base64_image}", }, }], }], temperature=temperature, top_p=top_p, max_tokens=1024, stream=stream,

frequency_penalty=0.5,

    #presence_penalty=0.5
)
if stream is True:
    text = ''
    for info in response:
        text += info.choices[0].delta.content
        yield text
else:
    text = response.choices[0].message.content
    return text

Environment

sys.platform: linux
Python: 3.9.19 (main, May  6 2024, 19:43:03) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda-11.8
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.7  (built against CUDA 11.8)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.0+
transformers: 4.42.3
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.8.2
triton: 2.2.0

Error traceback

服务端显示接受到信息:
INFO:     Started server process [3473]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8086 (Press CTRL+C to quit)
INFO:     192.168.102.18:33176 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     192.168.102.18:33820 - "POST /v1/chat/completions HTTP/1.1" 200 OK
GPU的使用率一直是0无任何波动
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:19:00.0 Off |                  N/A |
| 41%   32C    P8              18W / 260W |  22063MiB / 22528MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:1A:00.0 Off |                  N/A |
|  0%   37C    P8              22W / 260W |  22033MiB / 22528MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:67:00.0 Off |                  N/A |
|  0%   38C    P8              22W / 260W |  22033MiB / 22528MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:68:00.0 Off |                  N/A |
| 73%   38C    P8              10W / 250W |  21797MiB / 22528MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
客户端一直等待响应
processing | 381.1s
kklots commented 1 month ago

使用0.4.2版本,同样的步骤和代码,InternVL-Chat-V1-5-AWQ服务工作正常。

irexyc commented 1 month ago

启动server的时候加上 --log-level INFO ,然后把日志发一下吧,包括启动server 和 收到请求。

kklots commented 1 month ago

0.5.0 服务器启动: 都是类似下面的警告 ... [TM][WARNING] Device 1 peer access Device 2 is not available. [TM][WARNING] Device 1 peer access Device 3 is not available. [TM][WARNING] Device 2 peer access Device 0 is not available. [TM][WARNING] Device 2 peer access Device 1 is not available. [TM][WARNING] Device 2 peer access Device 3 is not available. [TM][WARNING] Device 3 peer access Device 0 is not available. [TM][WARNING] Device 3 peer access Device 1 is not available. [TM][WARNING] Device 3 peer access Device 2 is not available. [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][WARNING] Device 0 peer access Device 2 is not available. [TM][WARNING] Device 0 peer access Device 3 is not available. [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][WARNING] Device 1 peer access Device 2 is not available. [TM][WARNING] Device 1 peer access Device 3 is not available. [TM][WARNING] Device 2 peer access Device 0 is not available. [TM][WARNING] Device 2 peer access Device 1 is not available. [TM][WARNING] Device 2 peer access Device 3 is not available. [TM][WARNING] Device 3 peer access Device 0 is not available. [TM][WARNING] Device 3 peer access Device 1 is not available. [TM][WARNING] Device 3 peer access Device 2 is not available. [TM][WARNING] Device 0 peer access Device 1 is not available. [TM][WARNING] Device 0 peer access Device 2 is not available. [TM][WARNING] Device 0 peer access Device 3 is not available. [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][WARNING] Device 1 peer access Device 2 is not available. [TM][WARNING] Device 1 peer access Device 3 is not available. [TM][WARNING] Device 2 peer access Device 0 is not available. [TM][WARNING] Device 2 peer access Device 1 is not available. [TM][WARNING] Device 2 peer access Device 3 is not available. [TM][WARNING] Device 3 peer access Device 0 is not available. [TM][WARNING] Device 3 peer access Device 1 is not available. [TM][WARNING] Device 3 peer access Device 2 is not available. HINT: Please open http://0.0.0.0:8086 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:8086 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:8086 in a browser for detailed api usage!!! INFO: Started server process [10633] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8086 (Press CTRL+C to quit)

服务端收到请求: INFO: 192.168.102.18:59546 - "GET /v1/models HTTP/1.1" 200 OK INFO: 192.168.102.18:59560 - "POST /v1/chat/completions HTTP/1.1" 200 OK 2024-07-11 04:00:08,818 - lmdeploy - INFO - start ImageEncoder._forward_loop 2024-07-11 04:00:08,818 - lmdeploy - INFO - ImageEncoder received 1 images, left 1 images. 2024-07-11 04:00:08,818 - lmdeploy - INFO - ImageEncoder process 1 images, left 0 images.

0.4.2: 服务端启动,都是类似下方的警告 ... [TM][INFO] Set logger level by INFO [TM][WARNING] Device 1 peer access Device 0 is not available. [TM][WARNING] Device 1 peer access Device 2 is not available. [TM][WARNING] Device 1 peer access Device 3 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 2 peer access Device 0 is not available. [TM][WARNING] Device 2 peer access Device 1 is not available. [TM][WARNING] Device 2 peer access Device 3 is not available. [TM][INFO] Set logger level by INFO [TM][WARNING] Device 3 peer access Device 0 is not available. [TM][WARNING] Device 3 peer access Device 1 is not available. [TM][WARNING] Device 3 peer access Device 2 is not available. HINT: Please open http://0.0.0.0:8086 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:8086 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:8086 in a browser for detailed api usage!!! INFO: Started server process [11513] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8086 (Press CTRL+C to quit)

服务端收到请求: INFO: 192.168.102.18:51190 - "GET /v1/models HTTP/1.1" 200 OK INFO: 192.168.102.18:51190 - "POST /v1/chat/completions HTTP/1.1" 200 OK 2024-07-11 04:04:07,163 - lmdeploy - INFO - ImageEncoder received 1 images, left 1 images. 2024-07-11 04:04:07,163 - lmdeploy - INFO - ImageEncoder process 1 images, left 0 images. /data2/lixuan/miniconda3/envs/lmdeploy/lib/python3.9/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /data2/lixuan/miniconda3/envs/lmdeploy/lib/python3.9/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( 2024-07-11 04:04:08,692 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 1.529s 2024-07-11 04:04:08,692 - lmdeploy - INFO - ImageEncoder done 1 images, left 0 images. 2024-07-11 04:04:08,699 - lmdeploy - INFO - prompt='<|im_start|>system\nYou are an AI assistant whose name is InternLM (书生·浦语).<|im_end|>\n<|im_start|>user\n\n请告诉我图片中的内容,并以中文回答<|im_end|>\n<|im_start|>assistant\n', gen_config=EngineGenerationConfig(n=1, max_new_tokens=1024, top_p=0.8, top_k=40, temperature=0.8, repetition_penalty=1.0, ignore_eos=False, random_seed=15936302334360235144, stop_words=[92542, 92540], bad_words=None, min_new_tokens=None, skip_special_tokens=True, logprobs=None), prompt_token_id=[1, 92543, 9081, 364, 2770, 657, 589, 15358, 17993, 6843, 963, 505, 4576, 11146, 451, 60628, 60384, 60721, 62442, 60752, 699, 92542, 364, 92543, 1008, 364, 92544, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 92545, 364, 60836, 71404, 68467, 68322, 68341, 60353, 81771, 69093, 68855, 92542, 364, 92543, 525, 11353, 364], adapter_name=None. 2024-07-11 04:04:08,699 - lmdeploy - INFO - session_id=1, history_tokens=0, input_tokens=812, max_new_tokens=1024, seq_start=True, seq_end=True, step=0, prep=True [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [forward] Enqueue requests [TM][INFO] [forward] Wait for requests to complete ... [TM][INFO] Set logger level by INFO [TM][WARNING] [ProcessInferRequests] Request for 1 received. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO [TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 812, max_q = 812, max_k = 812 [TM][INFO] Set logger level by INFO [TM][INFO] ------------------------- step = 820 ------------------------- [TM][INFO] ------------------------- step = 830 ------------------------- [TM][INFO] ------------------------- step = 840 ------------------------- [TM][INFO] ------------------------- step = 850 ------------------------- [TM][INFO] ------------------------- step = 860 ------------------------- [TM][INFO] ------------------------- step = 870 ------------------------- [TM][INFO] ------------------------- step = 880 ------------------------- [TM][INFO] ------------------------- step = 890 ------------------------- [TM][INFO] ------------------------- step = 900 ------------------------- [TM][INFO] [Interrupt] slot = 0, id = 1 [TM][INFO] [forward] Request complete for 1, code 0

parasol-ry commented 1 month ago

internvl2 也遇到了同样的问题

kong1414 commented 1 month ago

好像是因为0.5.0版本有些默认参数和0.4.2不一致,你试试启动命令加这几个参数看看 --model-name internvl-internlm2 --session-len 32776 --num-tokens-per-iter 8192

也支持internvl2 我试过8B和26B-AWQ都能用

irexyc commented 1 month ago

这个问题好像跟 https://github.com/InternLM/lmdeploy/issues/1981 类似,具体原因我还不清楚,猜测应该跟image encoder的实现跟多线程换成协程有关 lmdeploy/vl/engine.py,出现这个问题的,可以尝试把这一行改成

features = await self.vl_encoder.infer(images) https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/serve/vl_async_engine.py#L66

parasol-ry commented 1 month ago

--session-len 32776 --num-tokens-per-iter 8192 加上这两个参数就好使了

kong1414 commented 1 month ago

我看了0.5.0版本的启动日志,session_len这个参数默认设置成9w多,我显存最多能到6w,所以就会炸。

仔细对比0.4.2和0.5.0在启动时打印的日志信息就能发现了。还有没有其他的参数变动就不知道了

kklots commented 1 month ago

好像是因为0.5.0版本有些默认参数和0.4.2不一致,你试试启动命令加这几个参数看看 --model-name internvl-internlm2 --session-len 32776 --num-tokens-per-iter 8192

也支持internvl2 我试过8B和26B-AWQ都能用

请问internvl 26B-AWQ在哪里可以下?谢谢

lvhan028 commented 1 month ago

我看了0.5.0版本的启动日志,session_len这个参数默认设置成9w多,我显存最多能到6w,所以就会炸。

仔细对比0.4.2和0.5.0在启动时打印的日志信息就能发现了。还有没有其他的参数变动就不知道了

这个问题在 #2007 解决了,v0.5.1 release,下周

zhyncs commented 1 month ago

Hi @kklots May you try this https://github.com/zhyncs/lmdeploy-build/releases/tag/aa07f92

nzomi commented 1 month ago

Hi @lvhan028 @zhyncs @irexyc Is it possible to use the same deployment method mentioned in this issue to infer a batch of prompts? Just like how we can input a list of [(prompt, image), ...] into the pipeline to further speed up the inference.

lvhan028 commented 1 month ago

Hi @lvhan028 @zhyncs @irexyc Is it possible to use the same deployment method mentioned in this issue to infer a batch of prompts? Just like how we can input a list of [(prompt, image), ...] into the pipeline to further speed up the inference.

Yes, you can. Please refer to the guidance mentioned here The messages can be a list of request

nzomi commented 1 month ago

@lvhan028 Thanks for your response! I followed the guidance and multiplied the messages list by 10 usingmessages = messages * 10. I expected to get 10 responses but only received one. Did I do something wrong? image

lvhan028 commented 1 month ago

Sorry, my bad. I found in vlm scenarios, lmdeploy didn't support the batch message input. Let me check with the team if it is a bug.

irexyc commented 1 month ago

@nzomi

According to link1link2 and link3, I don't think openai api supports batch inputs

nzomi commented 1 month ago

@irexyc Thank you! Currently, do you know which APIs support batch inputs?

irexyc commented 1 month ago

@nzomi

LMdeploy openai api doesn't support batch inputs. But if you send multiple requests in a short period of time, the requests will run in batch which will improve throughput.

stwrd commented 1 month ago

好像是因为0.5.0版本有些默认参数和0.4.2不一致,你试试启动命令加这几个参数看看 --model-name internvl-internlm2 --session-len 32776 --num-tokens-per-iter 8192

也支持internvl2 我试过8B和26B-AWQ都能用

我用0.5.0版本也遇到一样的问题,运行InternVL2-26B两小时左右服务端卡住。我也试一下这个方法,看好使不。