InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.59k stars 420 forks source link

[Bug] finish_reason='length' once setting max_tokens with openai API #1589

Closed Hzzone closed 5 months ago

Hzzone commented 5 months ago

Checklist

Describe the bug

The latest lmdeploy 0.4.1 The bug is described as the title. Empty string once setting max_tokens with finish_reason='length'. Works well without max_tokens. More details are shown in code.

Reproduction

Run the server as

lmdeploy serve api_server liuhaotian/llava-v1.6-34b --server-port 12345 --tp 1 --session-len 4096
lmdeploy serve api_server /root/data/zhizhonghuang/checkpoints/llama3-70B-Instruct-hf --server-port 21121 --tp 2 --session-len 8192

Reproduce code:

import torchvision
from PIL import Image
import io
import base64
from openai import OpenAI, AzureOpenAI, AsyncAzureOpenAI, AsyncOpenAI
import re
import os
import random
import json
import requests
import torchvision
from PIL import Image
import io
import base64
from openai import OpenAI, AzureOpenAI
import re
import torch
import torch.nn.functional as F
import tqdm
import requests
torch.cuda.set_device(5)
torch.autograd.set_grad_enabled(False)
/root/miniconda/envs/densecaption/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

<torch.autograd.grad_mode.set_grad_enabled at 0x7ff1afd4a3d0>
base_urls = {
    'liuhaotian/llava-v1.6-34b': 'http://0.0.0.0:12345/v1',
    'llama3-70B': 'http://0.0.0.0:12347/v1',
}
llm_clients = {}

for name in base_urls:
    client = OpenAI(api_key='YOUR_API_KEY', base_url=base_urls[name])
    llm_clients[name] = client

model_name = 'liuhaotian/llava-v1.6-34b' client = llm_clients[model_name] response = client.chat.completions.create( model=client.models.list().data[0].id, messages=[{'role': 'user', 'content': [ {'type': 'text', 'text': 'who are you'}, ]}],

max_tokens=8192,

# temperature=.5,
temperature=0,
top_p=1.0,

) response

model_name = 'liuhaotian/llava-v1.6-34b'
client = llm_clients[model_name]
response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{'role': 'user', 'content': [
        {'type': 'text', 'text': 'who are you'}, 
        ]}],
    max_tokens=8192,
    # temperature=.5,
    temperature=0,
    top_p=1.0,
)
response
ChatCompletion(id='10', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='', role='assistant', function_call=None, tool_calls=None))], created=1715656169, model='llava-chatml', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=0, prompt_tokens=23, total_tokens=23))
model_name = 'llama3-70B'
client = llm_clients[model_name]
response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{'role': 'user', 'content': [
        {'type': 'text', 'text': 'who are you'}, 
        ]}],
    # max_tokens=8192,
    # temperature=.5,
    temperature=0,
    top_p=1.0,
)
response
ChatCompletion(id='3', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', role='assistant', function_call=None, tool_calls=None))], created=1715656197, model='vicuna', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=183, prompt_tokens=45, total_tokens=228))
model_name = 'llama3-70B'
client = llm_clients[model_name]
response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{'role': 'user', 'content': [
        {'type': 'text', 'text': 'who are you'}, 
        ]}],
    max_tokens=8192,
    # temperature=.5,
    temperature=0,
    top_p=1.0,
)
response
ChatCompletion(id='2', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='', role='assistant', function_call=None, tool_calls=None))], created=1715656188, model='vicuna', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=0, prompt_tokens=45, total_tokens=45))

Environment

sys.platform: linux
Python: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.2.1+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu118
LMDeploy: 0.4.1+5cf213e
transformers: 4.41.0.dev0
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

No response

irexyc commented 5 months ago

If you start the server with --session-len, you can't pass max_tokens the same value as session_len.

Because there is logic: if the length of input token plus max_tokens is greater than session_len, the engine will just give empty output with finish_reason='length'.

While, if you don't pass the max_tokens, the potential output length should be session minus history context length.

Hzzone commented 5 months ago

Thanks for your answer.