Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.

Describe the bug

无论是使用pytorch还是turbmind后端，模型的输出都为空 [Response(text='', generate_token_len=0, input_token_len=742, session_id=0, finish_reason='length', token_ids=[], logprobs=None)]

Reproduction

""" inference test """

import os import json import torch import requests import time import numpy as np

from tqdm import tqdm from peft import PeftModel from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import lmdeploy from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig, TurbomindEngineConfig

from src.utils.build_prompt import build_prompt_map

os instructions

os.system("nvidia-smi")

def memory_use_compute(model, type_size: int=2):

model.parameters()取出这个model所有的权重参数

para = sum([np.prod(list(p.size())) for p in model.parameters()])
# 下面的type_size是4，因为我们的参数是float32也就是4B，4个字节
print('Model {} : params: {:4f}G'.format(model._get_name(), para * type_size / 1000 / 1000 / 1000))

def main(): resp_label_list = ["A", "B", "C"] train_model_adapter_path = "/mnt/bn/internlm218b/checkpoint-15000"

base_model_path = "/mnt/bn/base_model/internlm218b"

eval_local_path = "/mnt/bn/label_data/train.jsonl"

# pytorch backend
backend_config = PytorchEngineConfig(
    session_len=2048,
    adapters=dict(lora_name_1=train_model_adapter_path), 
)

# # 调低 k/v cache内存占比调整为总显存的 20%
# backend_config = TurbomindEngineConfig(
#     session_len=2048,
#     cache_max_entry_count=0.2, 
#     adapters=dict(lora_name_1=train_model_adapter_path), 
# )

gen_config = GenerationConfig(
    top_p=0.7,
    top_k=50,
    temperature=0.8,
    max_new_tokens=2048, 
)

pipe = pipeline(
    model_path=base_model_path,
    model_name="internlm218b", 
    backend_config=backend_config,
)

def batch_infer(prompt_text):
    if isinstance(prompt_text, str):
        inputs_texts = [prompt_text]
    else:
        inputs_texts = prompt_text
    response = pipe(
        inputs_texts, 
        gen_config=gen_config, 
        adapter_name="lora_name_1", 
    )
    print(response)

with open(eval_local_path, "r", encoding="utf-8") as file, open(
    "/root/demo/test.jsonl", "w", encoding="utf-8") as write_file:
    max_samples = 100
    start_time = time.time()
    counter = 0
    prompt_queue = []
    for index, line in tqdm(enumerate(file)):  # len:11790
        if max_samples and index > max_samples:
            break
        if counter < 1:
            line = line.strip().strip("\n")
            obj = json.loads(line)
            prompt_queue.append([
                {"role": "user", "content": obj["prompt"]}
            ])
            counter += 1
        else:
            batch_infer(prompt_queue)
            counter = 0 
            prompt_queue.clear()

    print(f"cost time:{time.time() - start_time}s")

if name == "main": main()

Environment

sys.platform: linux
Python: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
CUDA available: False
MUSA available: False
numpy_random_seed: 2147483648
GCC: gcc (Debian 10.2.1-6) 10.2.1 20210110
PyTorch: 2.3.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

LMDeploy: 0.4.1+
transformers: 4.41.1
gradio: 4.21.0
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

模型输出：[Response(text='', generate_token_len=0, input_token_len=742, session_id=0, finish_reason='length', token_ids=[], logprobs=None)]

InternLM / lmdeploy

模型输出为空 #1659