使用lmdeploy部署后模型返回空

ChingKwanCheung commented 1 month ago

https://github.com/InternLM/lmdeploy/issues/1991#issue-2402071158 问题跟这里类似，问一些比较短的问题能正常输出，但是问一些比较长的问题（超过1万字，不超过session-len里面设置的长度）结果返回空。模型：qwen1.5-7b-chat 启动脚本：lmdeploy serve api_server qwen1half-7b-chat的模型路径 --server-name 0.0.0.0 --server-port 6002 --tp 1 --cache-max-entry-count 0.2 --rope-scaling-factor 0.2 --session-len 32000 日志报错：lmdeploy - ERROR - Truncate max_new_tokens to 19663 请问这是什么原因呢？

lvhan028 commented 1 month ago

lmdeploy - ERROR - Truncate max_new_tokens to 19663 这表信息表示，最多生成 19663 个token。它会根据 session_len 和 input prompt tokens 来计算。能不能运行下 lmdeploy check_env，把环境信息发一下？

lvhan028 commented 1 month ago

另外，还请在执行 lmdeploy serve 命令时，加上 --log-level INFO 选项，分享下启动日志。

ChingKwanCheung commented 1 month ago

lmdeploy - ERROR - Truncate max_new_tokens to 19663 这表信息表示，最多生成 19663 个token。它会根据 session_len 和 input prompt tokens 来计算。能不能运行下 lmdeploy check_env，把环境信息发一下？

我同样的问题，问离线转换成turbomind的internlm2-7b是有正常结果输出的，我的环境信息是： sys.platform: linux Python: 3.11.4 (main, Jul 5 2023, 14:15:25) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7: Tesla V100-PCIE-32GB CUDA_HOME: /usr NVCC: Cuda compilation tools, release 11.7, V11.7.64 GCC: gcc (GCC) 7.3.0 PyTorch: 2.0.1+cu117 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.7
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.9.2 (built against CUDA 12.1)
- Built with CuDNN 8.5
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.15.2+cu117 LMDeploy: 0.5.0+ transformers: 4.38.2 gradio: 4.16.0 fastapi: 0.109.0 pydantic: 2.5.2 triton: 2.0.0

ChingKwanCheung commented 1 month ago

另外，还请在执行 lmdeploy serve 命令时，加上 --log-level INFO 选项，分享下启动日志。

我加上这个选项后报错：usage: lmdeploy [-h] [-v] {lite,serve,convert,list,check_env,chat} ... lmdeploy: error: unrecognized arguments: 是哪里不对吗

lvhan028 commented 1 month ago

另外，还请在执行 lmdeploy serve 命令时，加上 --log-level INFO 选项，分享下启动日志。

我加上这个选项后报错：usage: lmdeploy [-h] [-v] {lite,serve,convert,list,check_env,chat} ... lmdeploy: error: unrecognized arguments: 是哪里不对吗

lmdeploy serve api_server qwen1half-7b-chat的模型路径 --server-name 0.0.0.0 --server-port 6002 --tp 1 --cache-max-entry-count 0.2 --rope-scaling-factor 0.2 --session-len 32000 --log-level INFO

ChingKwanCheung commented 1 month ago

另外，还请在执行 lmdeploy serve 命令时，加上 --log-level INFO 选项，分享下启动日志。

我加上这个选项后报错：usage: lmdeploy [-h] [-v] {lite,serve,convert,list,check_env,chat} ... lmdeploy: error: unrecognized arguments: 是哪里不对吗

lmdeploy serve api_server qwen1half-7b-chat的模型路径 --server-name 0.0.0.0 --server-port 6002 --tp 1 --cache-max-entry-count 0.2 --rope-scaling-factor 0.2 --session-len 32000 --log-level INFO

日志如下： usage: lmdeploy serve api_server [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT] [--allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]] [--allow-credentials] [--allow-methods ALLOW_METHODS [ALLOW_METHODS ...]] [--allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]] [--qos-config-path QOS_CONFIG_PATH] [--backend {pytorch,turbomind}] [--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}] [--api-keys [API_KEYS ...]] [--ssl] [--meta-instruction META_INSTRUCTION] [--chat-template CHAT_TEMPLATE] [--cap {completion,infilling,chat,python}] [--revision REVISION] [--download-dir DOWNLOAD_DIR] [--adapters [ADAPTERS ...]] [--tp TP] [--model-name MODEL_NAME] [--session-len SESSION_LEN] [--max-batch-size MAX_BATCH_SIZE] [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT] [--cache-block-seq-len CACHE_BLOCK_SEQ_LEN] [--enable-prefix-caching] [--model-format {hf,llama,awq}] [--quant-policy {0,4,8}] [--rope-scaling-factor ROPE_SCALING_FACTOR] [--num-tokens-per-iter NUM_TOKENS_PER_ITER] [--max-prefill-iters MAX_PREFILL_ITERS] [--vision-max-batch-size VISION_MAX_BATCH_SIZE] model_path lmdeploy serve api_server: error: argument --log-level: invalid choice: 'INFO\r' (choose from 'CRITICAL', 'FATAL', 'ERROR', 'WARN', 'WARNING', 'INFO', 'DEBUG', 'NOTSET') (lmdeploy) [z00854892@aiservice-01 InternLM-main]$ sh lmdeploy_start.sh 2024-07-12 16:46:42,189 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name=None, model_format=None, tp=1, session_len=32000, max_batch_size=128, cache_max_entry_count=0.2, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.2, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-07-12 16:46:42,189 - lmdeploy - INFO - input chat_template_config=None 2024-07-12 16:46:42,379 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-07-12 16:46:42,379 - lmdeploy - INFO - model_source: ModelSource.HF_MODEL Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Device does not support bfloat16. Set float16 forcefully 2024-07-12 16:46:42,857 - lmdeploy - INFO - model_config:

[llama] model_name = qwen model_arch = Qwen2ForCausalLM tensor_para_size = 1 head_num = 32 kv_head_num = 32 vocab_size = 151936 num_layer = 32 inter_size = 11008 norm_eps = 1e-06 attn_bias = 1 start_id = 151643 end_id = 151645 session_len = 32000 weight_type = fp16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 128 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.2 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 4 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 rope_scaling_factor = 0.0 use_dynamic_ntk = 0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =

[TM][WARNING] [LlamaTritonModel] max_context_token_num = 32000. 2024-07-12 16:46:44,801 - lmdeploy - WARNING - get 291 model params

InternLM / lmdeploy

使用lmdeploy部署后模型返回空 #2006