用lmdeploy部署internlm2_5-7B-chat请求返回为空

wwwyfff commented 1 month ago

📚 The doc issue

Code部分：

def generate_QA():
    server_addr = "..."
    client = APIClient(server_addr, api_key=None)
    model_name = client.available_models[0]

    req = "你是一位善于总结文本大纲的机器人，你的任务是按照以下步骤分析长文本内容，并按照要求进行输出......"

    prompt = [{"role": "user", "content": req}]

    top_p = 0.9
    temperature = 0.7
    output_seqlen = 4096
    stream_output = False

    for response in client.chat_completions_v1(model=model_name, messages=prompt, temperature=temperature, top_p=top_p, n=1, max_tokens=output_seqlen, stream=stream_output, ignore_eos=False):
        print(response)
        output = response["choices"][0]['message']['content']

    return output

print(generate_QA())

如果我的prompt是比较简单的问题。比如“介绍一下自己”，返回结果就不为空，但如果prompt较为复杂，返回结果就是空值。如下图：

Suggest a potential alternative/fix

No response

lvhan028 commented 1 month ago

日志中显示 “CUDA runtime error: out of memory” 麻烦执行命令 "lmdeploy check_env"，把环境信息贴上来吧

wwwyfff commented 1 month ago

日志中显示 “CUDA runtime error: out of memory” 麻烦执行命令 "lmdeploy check_env"，把环境信息贴上来吧

您好！这是我的环境信息。麻烦您了。 sys.platform: linux Python: 3.9.19 (main, May 6 2024, 19:43:03) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: NVIDIA L4 CUDA_HOME: None GCC: gcc (Ubuntu 13.2.0-4ubuntu3) 13.2.0 PyTorch: 2.2.2+cu121 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 12.1
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 8.9.2
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121 LMDeploy: 0.5.0+ transformers: 4.42.3 gradio: Not Found fastapi: 0.111.0 pydantic: 2.8.2 triton: 2.2.0

lvhan028 commented 1 month ago

查了下L4的信息，内存是 24G 的推理引擎的default配置不太合适。还请提供如下信息，方便我们给一个合适的配置。

上下文的长度是多少
server需要支持多少并发

wwwyfff commented 1 month ago

查了下L4的信息，内存是 24G 的推理引擎的default配置不太合适。还请提供如下信息，方便我们给一个合适的配置。

上下文的长度是多少

server需要支持多少并发

单机16卡上下文长度128k 并发1

谢谢

lvhan028 commented 1 month ago

你用的server的启动命令是怎样的呢？

wwwyfff commented 1 month ago

你用的server的启动命令是怎样的呢？

CUDA_VISIBLE_DEVICES=6,9 lmdeploy serve api_server ./workspace --server-port 35010 --tp 2

CUDA_VISIBLE_DEVICES=6,9 lmdeploy serve api_server internlm2_5-7b-chat-1m-4bit/ --server-port 35010 --tp 2 --model-format awq

应该是第一个，后面准备试一下量化的版本

InternLM / lmdeploy