InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.01k stars 366 forks source link

使用lmdeploy部署后模型返回空 #2006

Open ChingKwanCheung opened 1 month ago

ChingKwanCheung commented 1 month ago

https://github.com/InternLM/lmdeploy/issues/1991#issue-2402071158 问题跟这里类似,问一些比较短的问题能正常输出,但是问一些比较长的问题(超过1万字,不超过session-len里面设置的长度)结果返回空。 模型:qwen1.5-7b-chat 启动脚本:lmdeploy serve api_server qwen1half-7b-chat的模型路径 --server-name 0.0.0.0 --server-port 6002 --tp 1 --cache-max-entry-count 0.2 --rope-scaling-factor 0.2 --session-len 32000 日志报错:lmdeploy - ERROR - Truncate max_new_tokens to 19663 请问这是什么原因呢?

lvhan028 commented 1 month ago

lmdeploy - ERROR - Truncate max_new_tokens to 19663 这表信息表示,最多生成 19663 个token。它会根据 session_len 和 input prompt tokens 来计算。 能不能运行下 lmdeploy check_env,把环境信息发一下?

lvhan028 commented 1 month ago

另外,还请在执行 lmdeploy serve 命令时,加上 --log-level INFO 选项,分享下启动日志。

ChingKwanCheung commented 1 month ago

lmdeploy - ERROR - Truncate max_new_tokens to 19663 这表信息表示,最多生成 19663 个token。它会根据 session_len 和 input prompt tokens 来计算。 能不能运行下 lmdeploy check_env,把环境信息发一下?

我同样的问题,问离线转换成turbomind的internlm2-7b是有正常结果输出的,我的环境信息是: sys.platform: linux Python: 3.11.4 (main, Jul 5 2023, 14:15:25) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7: Tesla V100-PCIE-32GB CUDA_HOME: /usr NVCC: Cuda compilation tools, release 11.7, V11.7.64 GCC: gcc (GCC) 7.3.0 PyTorch: 2.0.1+cu117 PyTorch compiling details: PyTorch built with:

TorchVision: 0.15.2+cu117 LMDeploy: 0.5.0+ transformers: 4.38.2 gradio: 4.16.0 fastapi: 0.109.0 pydantic: 2.5.2 triton: 2.0.0

ChingKwanCheung commented 1 month ago

另外,还请在执行 lmdeploy serve 命令时,加上 --log-level INFO 选项,分享下启动日志。

我加上这个选项后报错:usage: lmdeploy [-h] [-v] {lite,serve,convert,list,check_env,chat} ... lmdeploy: error: unrecognized arguments: 是哪里不对吗

lvhan028 commented 1 month ago

另外,还请在执行 lmdeploy serve 命令时,加上 --log-level INFO 选项,分享下启动日志。

我加上这个选项后报错:usage: lmdeploy [-h] [-v] {lite,serve,convert,list,check_env,chat} ... lmdeploy: error: unrecognized arguments: 是哪里不对吗

lmdeploy serve api_server qwen1half-7b-chat的模型路径 --server-name 0.0.0.0 --server-port 6002 --tp 1 --cache-max-entry-count 0.2 --rope-scaling-factor 0.2 --session-len 32000 --log-level INFO

ChingKwanCheung commented 1 month ago

另外,还请在执行 lmdeploy serve 命令时,加上 --log-level INFO 选项,分享下启动日志。

我加上这个选项后报错:usage: lmdeploy [-h] [-v] {lite,serve,convert,list,check_env,chat} ... lmdeploy: error: unrecognized arguments: 是哪里不对吗

lmdeploy serve api_server qwen1half-7b-chat的模型路径 --server-name 0.0.0.0 --server-port 6002 --tp 1 --cache-max-entry-count 0.2 --rope-scaling-factor 0.2 --session-len 32000 --log-level INFO

日志如下: usage: lmdeploy serve api_server [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT] [--allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]] [--allow-credentials] [--allow-methods ALLOW_METHODS [ALLOW_METHODS ...]] [--allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]] [--qos-config-path QOS_CONFIG_PATH] [--backend {pytorch,turbomind}] [--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}] [--api-keys [API_KEYS ...]] [--ssl] [--meta-instruction META_INSTRUCTION] [--chat-template CHAT_TEMPLATE] [--cap {completion,infilling,chat,python}] [--revision REVISION] [--download-dir DOWNLOAD_DIR] [--adapters [ADAPTERS ...]] [--tp TP] [--model-name MODEL_NAME] [--session-len SESSION_LEN] [--max-batch-size MAX_BATCH_SIZE] [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT] [--cache-block-seq-len CACHE_BLOCK_SEQ_LEN] [--enable-prefix-caching] [--model-format {hf,llama,awq}] [--quant-policy {0,4,8}] [--rope-scaling-factor ROPE_SCALING_FACTOR] [--num-tokens-per-iter NUM_TOKENS_PER_ITER] [--max-prefill-iters MAX_PREFILL_ITERS] [--vision-max-batch-size VISION_MAX_BATCH_SIZE] model_path lmdeploy serve api_server: error: argument --log-level: invalid choice: 'INFO\r' (choose from 'CRITICAL', 'FATAL', 'ERROR', 'WARN', 'WARNING', 'INFO', 'DEBUG', 'NOTSET') (lmdeploy) [z00854892@aiservice-01 InternLM-main]$ sh lmdeploy_start.sh 2024-07-12 16:46:42,189 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name=None, model_format=None, tp=1, session_len=32000, max_batch_size=128, cache_max_entry_count=0.2, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.2, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-07-12 16:46:42,189 - lmdeploy - INFO - input chat_template_config=None 2024-07-12 16:46:42,379 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-07-12 16:46:42,379 - lmdeploy - INFO - model_source: ModelSource.HF_MODEL Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Device does not support bfloat16. Set float16 forcefully 2024-07-12 16:46:42,857 - lmdeploy - INFO - model_config:

[llama] model_name = qwen model_arch = Qwen2ForCausalLM tensor_para_size = 1 head_num = 32 kv_head_num = 32 vocab_size = 151936 num_layer = 32 inter_size = 11008 norm_eps = 1e-06 attn_bias = 1 start_id = 151643 end_id = 151645 session_len = 32000 weight_type = fp16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 128 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.2 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 4 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 rope_scaling_factor = 0.0 use_dynamic_ntk = 0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =

[TM][WARNING] [LlamaTritonModel] max_context_token_num = 32000. 2024-07-12 16:46:44,801 - lmdeploy - WARNING - get 291 model params