InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.58k stars 419 forks source link

模型输出为空 #1659

Closed GZL11 closed 5 months ago

GZL11 commented 5 months ago

Checklist

Describe the bug

无论是使用pytorch还是turbmind后端,模型的输出都为空 [Response(text='', generate_token_len=0, input_token_len=742, session_id=0, finish_reason='length', token_ids=[], logprobs=None)]

Reproduction

""" inference test """

import os import json import torch import requests import time import numpy as np

from tqdm import tqdm from peft import PeftModel from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import lmdeploy from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig, TurbomindEngineConfig

from src.utils.build_prompt import build_prompt_map

os instructions

os.system("nvidia-smi")

def memory_use_compute(model, type_size: int=2):

model.parameters()取出这个model所有的权重参数

para = sum([np.prod(list(p.size())) for p in model.parameters()])
# 下面的type_size是4,因为我们的参数是float32也就是4B,4个字节
print('Model {} : params: {:4f}G'.format(model._get_name(), para * type_size / 1000 / 1000 / 1000))

def main(): resp_label_list = ["A", "B", "C"] train_model_adapter_path = "/mnt/bn/internlm218b/checkpoint-15000"

base_model_path = "/mnt/bn/base_model/internlm218b"

eval_local_path = "/mnt/bn/label_data/train.jsonl"

# pytorch backend
backend_config = PytorchEngineConfig(
    session_len=2048,
    adapters=dict(lora_name_1=train_model_adapter_path), 
)

# # 调低 k/v cache内存占比调整为总显存的 20%
# backend_config = TurbomindEngineConfig(
#     session_len=2048,
#     cache_max_entry_count=0.2, 
#     adapters=dict(lora_name_1=train_model_adapter_path), 
# )

gen_config = GenerationConfig(
    top_p=0.7,
    top_k=50,
    temperature=0.8,
    max_new_tokens=2048, 
)

pipe = pipeline(
    model_path=base_model_path,
    model_name="internlm218b", 
    backend_config=backend_config,
)

def batch_infer(prompt_text):
    if isinstance(prompt_text, str):
        inputs_texts = [prompt_text]
    else:
        inputs_texts = prompt_text
    response = pipe(
        inputs_texts, 
        gen_config=gen_config, 
        adapter_name="lora_name_1", 
    )
    print(response)

with open(eval_local_path, "r", encoding="utf-8") as file, open(
    "/root/demo/test.jsonl", "w", encoding="utf-8") as write_file:
    max_samples = 100
    start_time = time.time()
    counter = 0
    prompt_queue = []
    for index, line in tqdm(enumerate(file)):  # len:11790
        if max_samples and index > max_samples:
            break
        if counter < 1:
            line = line.strip().strip("\n")
            obj = json.loads(line)
            prompt_queue.append([
                {"role": "user", "content": obj["prompt"]}
            ])
            counter += 1
        else:
            batch_infer(prompt_queue)
            counter = 0 
            prompt_queue.clear()

    print(f"cost time:{time.time() - start_time}s")  

if name == "main": main()

Environment

sys.platform: linux
Python: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
CUDA available: False
MUSA available: False
numpy_random_seed: 2147483648
GCC: gcc (Debian 10.2.1-6) 10.2.1 20210110
PyTorch: 2.3.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

LMDeploy: 0.4.1+
transformers: 4.41.1
gradio: 4.21.0
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

模型输出:[Response(text='', generate_token_len=0, input_token_len=742, session_id=0, finish_reason='length', token_ids=[], logprobs=None)]
lvhan028 commented 5 months ago

你设定了上下文最大的长度 session_len 是 2048 在请求的时候,要求token最多生成 2048 个 token

这样的话,n_input_token + max_new_token (2048) > session_len (2048)。在这种情况下,直接返回空。

你可以使用小一些的 max_new_token,或者把 max_new_token 设置为 None,让 lmdeploy 来计算最多能生成多少。