vllm推理报错:无法在rope_scaling中获取factor字段

Potato-wll commented 2 months ago

这是我的运行代码： python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /home/wangll/llm/model_download_demo/models/Qwen/Qwen2-VL-7B-Instruct

以下是报错信息： INFO 09-03 18:48:04 api_server.py:440] vLLM API server version 0.5.5 INFO 09-03 18:48:04 api_server.py:441] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/wangll/llm/model_download_demo/models/Qwen/Qwen2-VL-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2-VL-7B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) Traceback (most recent call last): File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 476, in asyncio.run(run_server(args)) File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 443, in run_server async with build_async_engine_client(args) as async_engine_client: File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 117, in build_async_engine_client if (model_is_embedding(args.model, args.trust_remote_code, File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 71, in model_is_embedding return ModelConfig(model=model_name, File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/config.py", line 214, in init self.max_model_len = _get_and_verify_max_len( File "/home/wangll/.conda/envs/Qwen2vl/lib/python3.10/site-packages/vllm/config.py", line 1650, in _get_and_verify_max_len assert "factor" in rope_scaling AssertionError

我去看了模型的配置文件config.json,里面的rope_scaling确实没有factor字段， "rope_scaling": { "type": "mrope", "mrope_section": [ 16, 24, 24 ] }, "vocab_size": 152064 }

fyabc commented 2 months ago

@Potato-wll 您好，这是由于您使用的vllm版本不匹配导致的，具体可参考 #35 。

Potato-wll commented 2 months ago

我用vllm启动后报错，FlashAttention only supports Ampere GPUs or newer.我的显卡是T4，用不了flashatt，怎么在哪关

fyabc commented 2 months ago

我用vllm启动后报错，FlashAttention only supports Ampere GPUs or newer.我的显卡是T4，用不了flashatt，怎么在哪关

@Potato-wll 您好，我们更新了vllm代码以及相应的镜像，在不支持flash-attn的情况下使用xformers进行推理，请更新到最新的代码/镜像然后重试。

xyfZzz commented 2 months ago

我用vllm启动后报错，FlashAttention only supports Ampere GPUs or newer.我的显卡是T4，用不了flashatt，怎么在哪关

@Potato-wll 您好，我们更新了vllm代码以及相应的镜像，在不支持flash-attn的情况下使用xformers进行推理，请更新到最新的代码/镜像然后重试。

用你们最新的vllm代码安装后还是有个这个错：

  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 224, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1740, in _get_and_verify_max_len
    assert "factor" in rope_scaling

fyabc commented 2 months ago

用你们最新的vllm代码安装后还是有个这个错：

  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 224, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1740, in _get_and_verify_max_len
    assert "factor" in rope_scaling

你好，根据错误信息，这应该不是和https://github.com/fyabc/vllm/tree/add_qwen2_vl_new一致的最新版本，麻烦检查下git commit id是否正确。

xyfZzz commented 2 months ago

用你们最新的vllm代码安装后还是有个这个错：
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 224, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1740, in _get_and_verify_max_len
    assert "factor" in rope_scaling
你好，根据错误信息，这应该不是和https://github.com/fyabc/vllm/tree/add_qwen2_vl_new一致的最新版本，麻烦检查下git commit id是否正确。

好的谢谢，切了分支，目前正在重新安装，请问下这个vllm版本的话，qwen2vl支持单请求多图调用吗？

xyfZzz commented 2 months ago

用你们最新的vllm代码安装后还是有个这个错：
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 224, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1740, in _get_and_verify_max_len
    assert "factor" in rope_scaling
你好，根据错误信息，这应该不是和https://github.com/fyabc/vllm/tree/add_qwen2_vl_new一致的最新版本，麻烦检查下git commit id是否正确。

你好，切了这个分支还是报这个错，只是行数不一样了，请帮忙看看：

File "/mnt/xie/libs/vllm_qwen2vl/vllm/entrypoints/openai/api_server.py", line 73, in model_is_embedding
    return ModelConfig(model=model_name,
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 227, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1747, in _get_and_verify_max_len
    assert "factor" in rope_scaling
AssertionError

fyabc commented 2 months ago

用你们最新的vllm代码安装后还是有个这个错：
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 224, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1740, in _get_and_verify_max_len
    assert "factor" in rope_scaling
你好，根据错误信息，这应该不是和https://github.com/fyabc/vllm/tree/add_qwen2_vl_new一致的最新版本，麻烦检查下git commit id是否正确。
好的谢谢，切了分支，目前正在重新安装，请问下这个vllm版本的话，qwen2vl支持单请求多图调用吗？

Qwen2-VL支持单条请求多个图片，具体调用方式请参考这里

fyabc commented 2 months ago

用你们最新的vllm代码安装后还是有个这个错：
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 224, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1740, in _get_and_verify_max_len
    assert "factor" in rope_scaling
你好，根据错误信息，这应该不是和https://github.com/fyabc/vllm/tree/add_qwen2_vl_new一致的最新版本，麻烦检查下git commit id是否正确。
你好，切了这个分支还是报这个错，只是行数不一样了，请帮忙看看：
File "/mnt/xie/libs/vllm_qwen2vl/vllm/entrypoints/openai/api_server.py", line 73, in model_is_embedding
    return ModelConfig(model=model_name,
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 227, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1747, in _get_and_verify_max_len
    assert "factor" in rope_scaling
AssertionError

可以提供一下您下载的模型文件中的config.json内容吗？看起来是这里读取的时候出错了

xyfZzz commented 2 months ago

用你们最新的vllm代码安装后还是有个这个错：
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 224, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1740, in _get_and_verify_max_len
    assert "factor" in rope_scaling
你好，根据错误信息，这应该不是和https://github.com/fyabc/vllm/tree/add_qwen2_vl_new一致的最新版本，麻烦检查下git commit id是否正确。
你好，切了这个分支还是报这个错，只是行数不一样了，请帮忙看看：
File "/mnt/xie/libs/vllm_qwen2vl/vllm/entrypoints/openai/api_server.py", line 73, in model_is_embedding
    return ModelConfig(model=model_name,
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 227, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1747, in _get_and_verify_max_len
    assert "factor" in rope_scaling
AssertionError

可以提供一下您下载的模型文件中的config.json内容吗？看起来是这里读取的时候出错了 config 如下：

{
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"vision_start_token_id": 151652,
"vision_end_token_id": 151653,
"vision_token_id": 151654,
"image_token_id": 151655,
"video_token_id": 151656,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.41.2",
"use_cache": true,
"use_sliding_window": false,
"vision_config": {
"depth": 32,
"embed_dim": 1280,
"mlp_ratio": 4,
"num_heads": 16,
"in_chans": 3,
"hidden_size": 3584,
"patch_size": 14,
"spatial_merge_size": 2,
"spatial_patch_size": 14,
"temporal_patch_size": 2
},
"rope_scaling": {
"type": "mrope",
"mrope_section": [
16,
24,
24
]
},
"vocab_size": 152064
}

xyfZzz commented 2 months ago

用你们最新的vllm代码安装后还是有个这个错：
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 224, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1740, in _get_and_verify_max_len
    assert "factor" in rope_scaling
你好，根据错误信息，这应该不是和https://github.com/fyabc/vllm/tree/add_qwen2_vl_new一致的最新版本，麻烦检查下git commit id是否正确。

我的安装方法如下，不知道有没有问题？：

git clone https://github.com/fyabc/vllm.git
cd vllm
git checkout origin/add_qwen2_vl_new
pip install -e . --extra-index-url https://download.pytorch.org/whl/cu121

docShen commented 2 months ago

用你们最新的vllm代码安装后还是有个这个错：
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 224, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/mnt/xie/libs/vllm_qwen2vl/vllm/config.py", line 1740, in _get_and_verify_max_len
    assert "factor" in rope_scaling
你好，根据错误信息，这应该不是和https://github.com/fyabc/vllm/tree/add_qwen2_vl_new一致的最新版本，麻烦检查下git commit id是否正确。
我的安装方法如下，不知道有没有问题？：
git clone https://github.com/fyabc/vllm.git
cd vllm
git checkout origin/add_qwen2_vl_new
pip install -e . --extra-index-url https://download.pytorch.org/whl/cu121

我也是按照官方fork的vllm版本，但是还是会报这个错误

fyabc commented 2 months ago

@docShen @xyfZzz 可以提供下目前使用的transformers版本吗（pip list | grep transformers）？

xyfZzz commented 2 months ago

@docShen @xyfZzz 可以提供下目前使用的transformers版本吗（pip list | grep transformers）？

4.45.0.dev0

xyfZzz commented 2 months ago

@docShen @xyfZzz 可以提供下目前使用的transformers版本吗（pip list | grep transformers）？

请问是这里描述的这个问题导致的吗？：https://github.com/vllm-project/vllm/pull/7905#issuecomment-2338409524

SiyangJ commented 2 months ago

问题+1，exact same problem 完全follow您的步骤 @fyabc 在启动vllm openai server时出现了问题 root@3f75a56c8be9:/vllm-workspace# python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /weights/Qwen2-VL-7B-Instruct INFO 09-10 06:32:14 api_server.py:495] vLLM API server version 0.6.0 INFO 09-10 06:32:14 api_server.py:496] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/weights/Qwen2-VL-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2-VL-7B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'} Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 531, in asyncio.run(run_server(args)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 498, in run_server async with build_async_engine_client(args) as async_engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 110, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 132, in build_async_engine_client_from_engine_args if (model_is_embedding(engine_args.model, engine_args.trust_remote_code, File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 73, in model_is_embedding return ModelConfig(model=model_name, File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 224, in init self.max_model_len = _get_and_verify_max_len( File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1740, in _get_and_verify_max_len assert "factor" in rope_scaling AssertionError

fyabc commented 2 months ago

@xyfZzz @docShen @Potato-wll 您好，这应当是transformers最新版本的一个bug，我已经提交了相关issue，目前请先使用如下方式安装没有bug的版本：

pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

lilin-git commented 2 months ago

在config文件rope_scaling中加一个factor字段；另在https://github.com/fyabc/vllm/blob/6f3116c9dad0537b2858af703938aa9bf6c25bcf/vllm/model_executor/models/qwen2.py#L175 前面加一行if rope_scaling and rope_scaling.get('type','default') == 'default': rope_scaling['type'] = 'mrope'可暂时解决问题

fyabc commented 2 months ago

在config文件rope_scaling中加一个factor字段；另在https://github.com/fyabc/vllm/blob/6f3116c9dad0537b2858af703938aa9bf6c25bcf/vllm/model_executor/models/qwen2.py#L175 前面加一行if rope_scaling and rope_scaling.get('type','default') == 'default': rope_scaling['type'] = 'mrope'可暂时解决问题

@lilin-git 感谢说明，这里提到的前一个方法是可行的；后一个方法不建议（由于rope_scaling['type']在模型初始化之外的地方也被使用，只修改此处会导致bug）

@xyfZzz @Potato-wll @docShen 如果重新安装transformers较为麻烦，也可使用上面提到的方法。

xyfZzz commented 2 months ago

在config文件rope_scaling中加一个factor字段；另在https://github.com/fyabc/vllm/blob/6f3116c9dad0537b2858af703938aa9bf6c25bcf/vllm/model_executor/models/qwen2.py#L175 前面加一行if rope_scaling and rope_scaling.get('type','default') == 'default': rope_scaling['type'] = 'mrope'可暂时解决问题

@lilin-git 感谢说明，这里提到的前一个方法是可行的；后一个方法不建议（由于rope_scaling['type']在模型初始化之外的地方也被使用，只修改此处会导致bug）

@xyfZzz @Potato-wll @docShen 如果重新安装transformers较为麻烦，也可使用上面提到的方法。

正常运行了，感谢！

wuzhizhige commented 2 months ago

加了factor，改了qwen2.py后运行python -m vllm.entrypoints.openai.api_server --model /data/Qwen2-VL-7B-Instruct --served-model-name Qwen2-VL-7B --port 10000 报错(https://github.com/fyabc/vllm/blob/6f3116c9dad0537b2858af703938aa9bf6c25bcf/vllm/model_executor/models/qwen2.py#L175) File "/data/vllm/vllm/model_executor/models/init.py", line 170, in resolve_model_cls raise ValueError( ValueError: Model architectures ['Qwen2VLForConditionalGeneration'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'ArcticForCausalLM', 'XverseForCausalLM', 'Phi3SmallForCausalLM', 'MedusaModel', 'EAGLEModel', 'MLPSpeculatorPreTrainedModel', 'JambaForCausalLM', 'GraniteForCausalLM', 'MistralModel', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'MiniCPMV', 'PaliGemmaForConditionalGeneration', 'Phi3VForCausalLM', 'UltravoxModel', 'QWenLMHeadModel', 'BartModel', 'BartForConditionalGeneration'] ERROR 09-10 19:42:19 api_server.py:188] RPCServer process died before responding to readiness probe

fyabc commented 2 months ago

加了factor，改了qwen2.py后运行python -m vllm.entrypoints.openai.api_server --model /data/Qwen2-VL-7B-Instruct --served-model-name Qwen2-VL-7B --port 10000 报错(https://github.com/fyabc/vllm/blob/6f3116c9dad0537b2858af703938aa9bf6c25bcf/vllm/model_executor/models/qwen2.py#L175) File "/data/vllm/vllm/model_executor/models/init.py", line 170, in resolve_model_cls raise ValueError( ValueError: Model architectures ['Qwen2VLForConditionalGeneration'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'ArcticForCausalLM', 'XverseForCausalLM', 'Phi3SmallForCausalLM', 'MedusaModel', 'EAGLEModel', 'MLPSpeculatorPreTrainedModel', 'JambaForCausalLM', 'GraniteForCausalLM', 'MistralModel', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'MiniCPMV', 'PaliGemmaForConditionalGeneration', 'Phi3VForCausalLM', 'UltravoxModel', 'QWenLMHeadModel', 'BartModel', 'BartForConditionalGeneration'] ERROR 09-10 19:42:19 api_server.py:188] RPCServer process died before responding to readiness probe

您好，请检查一下您使用的vllm版本，似乎不是正确的版本？

wuzhizhige commented 2 months ago

加了factor，改了 qwen2.py 后运行python -m vllm.entrypoints.openai.api_server --model /data/Qwen2-VL-7B-Instruct --served-model-name Qwen2-VL-7B --port 10000 报错（https://github.com/fyabc/vllm/blob/6f3116c9dad0537b2858af703938aa9bf6c25bcf/vllm/model_executor/models/qwen2.py#L175）文件 “/data/vllm/vllm/model_executor/models/init.py”，第 170 行，resolve_model_cls引发 ValueError（ ValueError： Model architectures ['Qwen2VLForConditionalGeneration'] 目前不受支持。支持的架构： ['AquilaModel'， 'AquilaForCausalLM'， 'BaichuanForCausalLM'， 'BaichuanForCausalLM'， 'BloomForCausalLM'， 'ChatGLMModel'， 'ChatGLMForConditionalGeneration'， 'CohereForCausalLM'， 'DbrxForCausalLM'， 'DeciLMForCausalLM'， 'DeepseekForCausalLM'， 'DeepseekV2ForCausalLM'， 'ExaoneForCausalLM'， 'FalconForCausalLM'， 'GemmaForCausalLM'， 'Gemma2ForCausalLM'， 'GPT2LMHeadModel'， 'GPTBigCodeForCausalLM'， 'GPTJForCausalLM'， 'GPTNeoXForCausalLM'， 'InternLMForCausalLM'， 'InternLM2ForCausalLM'， 'JAISLMHeadModel'， 'LlamaForCausalLM'， 'LLaMAForCausalLM'， 'MistralForCausalLM'， 'MixtralForCausalLM'， 'QuantMixtralForCausalLM'， 'MptForCausalLM'， 'MPTForCausalLM'， 'MiniCPMForCausalLM'， 'NemotronForCausalLM'， 'OlmoForCausalLM'， 'OPTForCausalLM'， 'OrionForCausalLM'， 'PersimmonForCausalLM'， 'PhiForCausalLM'， 'Phi3ForCausalLM'， 'PhiMoEForCausalLM'， 'Qwen2ForCausalLM'， 'Qwen2MoeForCausalLM'， 'RWForCausalLM'， 'StableLMEpochForCausalLM'， 'StableLmForCausalLM'， 'Starcoder2ForCausalLM'， 'ArcticForCausalLM'， 'XverseForCausalLM'， 'Phi3SmallForCausalLM'， 'MedusaModel'， 'EAGLEModel'， 'MLPSpeculatorPreTrainedModel'， 'JambaForCausalLM'， 'GraniteForCausalLM'， 'MistralModel'， 'Blip2ForConditionalGeneration'， 'ChameleonForConditionalGeneration'， 'FuyuForCausalLM'， 'InternVLChatModel'， 'LlavaForConditionalGeneration'， 'LlavaNextForConditionalGeneration'， 'MiniCPMV'， 'PaliGemmaForConditionalGeneration'， 'Phi3VForCausalLM'， 'UltravoxModel'， 'QWenLMHeadModel'， 'BartModel'， 'BartForConditionalGeneration'] 错误 09-10 19：42：19 api_server.py：188] RPCServer 进程在响应就绪情况探测之前死亡

您好，请检查一下您使用的vllm版本，似乎不是正确的版本？

0.6.0和0.5.5都报这个错误

lilin-git commented 2 months ago

加了factor，改了 qwen2.py 后运行python -m vllm.entrypoints.openai.api_server --model /data/Qwen2-VL-7B-Instruct --served-model-name Qwen2-VL-7B --port 10000 报错（https://github.com/fyabc/vllm/blob/6f3116c9dad0537b2858af703938aa9bf6c25bcf/vllm/model_executor/models/qwen2.py#L175）文件 “/data/vllm/vllm/model_executor/models/init.py”，第 170 行，resolve_model_cls引发 ValueError（ ValueError： Model architectures ['Qwen2VLForConditionalGeneration'] 目前不受支持。支持的架构： ['AquilaModel'， 'AquilaForCausalLM'， 'BaichuanForCausalLM'， 'BaichuanForCausalLM'， 'BloomForCausalLM'， 'ChatGLMModel'， 'ChatGLMForConditionalGeneration'， 'CohereForCausalLM'， 'DbrxForCausalLM'， 'DeciLMForCausalLM'， 'DeepseekForCausalLM'， 'DeepseekV2ForCausalLM'， 'ExaoneForCausalLM'， 'FalconForCausalLM'， 'GemmaForCausalLM'， 'Gemma2ForCausalLM'， 'GPT2LMHeadModel'， 'GPTBigCodeForCausalLM'， 'GPTJForCausalLM'， 'GPTNeoXForCausalLM'， 'InternLMForCausalLM'， 'InternLM2ForCausalLM'， 'JAISLMHeadModel'， 'LlamaForCausalLM'， 'LLaMAForCausalLM'， 'MistralForCausalLM'， 'MixtralForCausalLM'， 'QuantMixtralForCausalLM'， 'MptForCausalLM'， 'MPTForCausalLM'， 'MiniCPMForCausalLM'， 'NemotronForCausalLM'， 'OlmoForCausalLM'， 'OPTForCausalLM'， 'OrionForCausalLM'， 'PersimmonForCausalLM'， 'PhiForCausalLM'， 'Phi3ForCausalLM'， 'PhiMoEForCausalLM'， 'Qwen2ForCausalLM'， 'Qwen2MoeForCausalLM'， 'RWForCausalLM'， 'StableLMEpochForCausalLM'， 'StableLmForCausalLM'， 'Starcoder2ForCausalLM'， 'ArcticForCausalLM'， 'XverseForCausalLM'， 'Phi3SmallForCausalLM'， 'MedusaModel'， 'EAGLEModel'， 'MLPSpeculatorPreTrainedModel'， 'JambaForCausalLM'， 'GraniteForCausalLM'， 'MistralModel'， 'Blip2ForConditionalGeneration'， 'ChameleonForConditionalGeneration'， 'FuyuForCausalLM'， 'InternVLChatModel'， 'LlavaForConditionalGeneration'， 'LlavaNextForConditionalGeneration'， 'MiniCPMV'， 'PaliGemmaForConditionalGeneration'， 'Phi3VForCausalLM'， 'UltravoxModel'， 'QWenLMHeadModel'， 'BartModel'， 'BartForConditionalGeneration'] 错误 09-10 19：42：19 api_server.py：188] RPCServer 进程在响应就绪情况探测之前死亡

您好，请检查一下您使用的vllm版本，似乎不是正确的版本？

0.6.0和0.5.5都报这个错误

额，看链接。不是官方版本，官方版本还不支持。这个项目：https://github.com/fyabc/vllm/tree/add_qwen2_vl_new

fyabc commented 2 months ago

0.6.0和0.5.5都报这个错误

@wuzhizhige 目前Qwen2-VL vllm支持尚未合并到官方，请使用这个版本：https://github.com/fyabc/vllm/tree/add_qwen2_vl_new

azuercici commented 2 months ago

@xyfZzz @docShen @Potato-wll 您好，这应当是transformers最新版本的一个bug，我已经提交了相关issue，目前请先使用如下方式安装没有bug的版本：
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

你好，我安装了你指定的这个版本的transformers还是有这个问题，请问在config.json中增加factor具体是怎么加呢？ "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "type": "mrope" 加在什么位置呢？

fyabc commented 2 months ago

@xyfZzz @docShen @Potato-wll 您好，这应当是transformers最新版本的一个bug，我已经提交了相关issue，目前请先使用如下方式安装没有bug的版本：
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
你好，我安装了你指定的这个版本的transformers还是有这个问题，请问在config.json中增加factor具体是怎么加呢？ "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "type": "mrope" 加在什么位置呢？

@azuercici 可以修改如下

{
  ...
  "rope_scaling": {
    "type": "mrope",
    "factor": 1,
    "mrope_section": [
      16,
      24,
      24
    ]
  },
}

EricHuiK commented 1 month ago

配置文件和vllm我都修改了但是还是报错：: NameError: name 'rod_scaling' is not defined

fyabc commented 1 month ago

配置文件和vllm我都修改了但是还是报错：: NameError: name 'rod_scaling' is not defined

注意名称应当是'rope_scaling'而非'rod_scaling'

zhangfan-algo commented 1 month ago

@xyfZzz @docShen @Potato-wll 您好，这应当是transformers最新版本的一个bug，我已经提交了相关issue，目前请先使用如下方式安装没有bug的版本：
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
你好，我安装了你指定的这个版本的transformers还是有这个问题，请问在config.json中增加factor具体是怎么加呢？ "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "type": "mrope" 加在什么位置呢？

目前使用这个版本的transformer还是会报原来的错误

zhangfan-algo commented 1 month ago

imkero commented 1 month ago

vLLM 初始化时传入 rope_scaling 参数覆盖原有 config，可以临时解决。

llm = LLM(
    model=model_dir,
    rope_scaling={
        "type": "mrope",
        "mrope_section": [
            16,
            24,
            24
        ],
    },
)

namelyuu commented 1 week ago

在config文件rope_scaling中加一个factor字段；另在https://github.com/fyabc/vllm/blob/6f3116c9dad0537b2858af703938aa9bf6c25bcf/vllm/model_executor/models/qwen2.py#L175 前面加一行if rope_scaling and rope_scaling.get('type','default') == 'default': rope_scaling['type'] = 'mrope'可暂时解决问题

@lilin-git 感谢说明，这里提到的前一个方法是可行的；后一个方法不建议（由于rope_scaling['type']在模型初始化之外的地方也被使用，只修改此处会导致bug）

@xyfZzz @Potato-wll @docShen 如果重新安装transformers较为麻烦，也可使用上面提到的方法。

@fyabc 看起来现在最新的transformers版本也没合入这个变更，还有啥方法不？

QwenLM / Qwen2-VL

vllm推理报错:无法在rope_scaling中获取factor字段 #96