InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.61k stars 422 forks source link

[Feature] Support openbmb/MiniCPM-V-2_6 (支援 openbmb/MiniCPM-V-2_6) #2349

Closed HSIAOKUOWEI closed 2 months ago

HSIAOKUOWEI commented 2 months ago

Motivation

I cannot deploy openbmb/MiniCPM-V-2_6 now, can you support it? 我現在無法部署openbmb/MiniCPM-V-2_6,可以支援嗎

Related resources

(lmdeploy) C:\Users\mi_ap>lmdeploy serve api_server --server-port 8000 --log-level INFO --backend turbomind --cache-max-entry-count 0.2 --model-name MiniCPM-V-2-5 --vision-max-batch-size 2 D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6 Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin, please note cuda version should >= 11.3 when compiled with cuda 11 2024-08-21 16:55:36,332 - lmdeploy - INFO - matching vision model: MiniCPMVModel D:\miniconda3\envs\lmdeploy\lib\site-packages\transformers\models\auto\image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-08-21 16:55:47,863 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='MiniCPM-V-2-5', model_format=None, tp=1, session_len=None, max_batch_size=128, cache_max_entry_count=0.2, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-08-21 16:55:47,863 - lmdeploy - INFO - input chat_template_config=None 2024-08-21 16:55:47,863 - lmdeploy - WARNING - Did not find a chat template matching D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6. 2024-08-21 16:55:47,877 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='base', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-08-21 16:55:47,877 - lmdeploy - INFO - model_source: hf_model 2024-08-21 16:55:47,877 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-08-21 16:55:48,348 - lmdeploy - INFO - model_config:

[llama] model_name = base model_arch = MiniCPMV tensor_para_size = 1 head_num = 28 kv_head_num = 4 vocab_size = 151666 num_layer = 28 inter_size = 18944 norm_eps = 1e-06 attn_bias = 1 start_id = 151644 end_id = 151645 session_len = 32776 weight_type = bf16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 128 max_prefill_token_num = 8192 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.2 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 5 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 original_max_position_embeddings = 0 rope_scaling_type = rope_scaling_factor = 0.0 use_dynamic_ntk = 0 low_freq_factor = 1.0 high_freq_factor = 1.0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =

[TM][WARNING] [LlamaTritonModel] max_context_token_num = 32776. [TM][INFO] Barrier(1) [TM][INFO] Model: head_num: 28 kv_head_num: 4 size_per_head: 128 inter_size: 18944 num_layer: 28 vocab_size: 151666 attn_bias: 1 max_batch_size: 128 max_prefill_token_num: 8192 max_context_token_num: 32776 session_len: 32776 step_length: 1 cache_max_entry_count: 0.2 cache_block_seq_len: 64 cache_chunk_size: -1 enable_prefix_caching: 0 use_context_fmha: 1 start_id: 151644 tensor_para_size: 1 pipeline_para_size: 1 enable_custom_all_reduce: 0 model_name: base model_dir: quant_policy: 0 group_size: 0

2024-08-21 16:55:48,490 - lmdeploy - WARNING - get 255 model params Convert to turbomind format: 0%| | 0/28 [00:00<?, ?it/s]Traceback (most recent call last): File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\miniconda3\envs\lmdeploy\Scripts\lmdeploy.exe__main.py", line 7, in sys.exit(run()) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\cli\entrypoint.py", line 36, in run args.run(args) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\cli\serve.py", line 298, in api_server run_api_server(args.model_path, File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\openai\api_server.py", line 1285, in serve VariableInterface.async_engine = pipeline_class( File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\vl_async_engine.py", line 24, in init super().init(model_path, **kwargs) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\async_engine.py", line 190, in init__ self._build_turbomind(model_path=model_path, File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\async_engine.py", line 235, in _build_turbomind self.engine = tm.TurboMind.from_pretrained( File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\turbomind.py", line 340, in from_pretrained return cls(model_path=pretrained_model_name_or_path, File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\turbomind.py", line 144, in init self.model_comm = self._from_hf(model_source=model_source, File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\turbomind.py", line 257, in _from_hf output_model.export() File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\base.py", line 289, in export self.export_transformer_block(bin, i) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\fp.py", line 65, in export_transformer_block qb, kb, vb, ob = transpose_tensor([qb, kb, vb, ob]) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\fp.py", line 13, in transpose_tensor output = [x.cuda().t() for x in input] File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\fp.py", line 13, in output = [x.cuda().t() for x in input] AttributeError: 'NoneType' object has no attribute 'cuda'

Additional context

No response

irexyc commented 2 months ago

shoud be supported in latest code, you can download from here

https://github.com/zhyncs/lmdeploy-build/releases/tag/3ffb0c4

HSIAOKUOWEI commented 2 months ago

The same , bro

(lmdeploy) C:\Users\mi_ap>pip install D:\lmdeploy-0.5.3+cu121+c1923f4-cp310-cp310-win_amd64.whl --force-reinstall --no-deps Defaulting to user installation because normal site-packages is not writeable Processing d:\lmdeploy-0.5.3+cu121+c1923f4-cp310-cp310-win_amd64.whl Installing collected packages: lmdeploy Attempting uninstall: lmdeploy Found existing installation: lmdeploy 0.5.3 Uninstalling lmdeploy-0.5.3: Successfully uninstalled lmdeploy-0.5.3 Successfully installed lmdeploy-0.5.3

(lmdeploy) C:\Users\mi_ap>lmdeploy serve api_server --server-name 0.0.0.0 --server-port 8000 --model-name MiniCPM-V-2_6 --cache-max-entry-count 0.4 --vision-max-batch-size 3 --log-level INFO D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6 Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin, please note cuda version should >= 11.3 when compiled with cuda 11 2024-08-23 11:37:44,635 - lmdeploy - INFO - matching vision model: MiniCPMVModel D:\miniconda3\envs\lmdeploy\lib\site-packages\transformers\models\auto\image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-08-23 11:39:00,626 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_format=None, tp=1, session_len=None, max_batch_size=128, cache_max_entry_count=0.4, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-08-23 11:39:00,626 - lmdeploy - INFO - input chat_template_config=None 2024-08-23 11:39:00,627 - lmdeploy - WARNING - Did not find a chat template matching D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6. 2024-08-23 11:39:00,683 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='base', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-08-23 11:39:00,683 - lmdeploy - INFO - model_source: hf_model Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-08-23 11:39:01,185 - lmdeploy - INFO - model_config:

[llama] model_name = chat_template = model_arch = MiniCPMV tensor_para_size = 1 head_num = 28 kv_head_num = 4 vocab_size = 151666 num_layer = 28 inter_size = 18944 norm_eps = 1e-06 attn_bias = 1 start_id = 151644 end_id = 151645 session_len = 32776 weight_type = bf16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 128 max_prefill_token_num = 8192 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.4 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 5 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 original_max_position_embeddings = 0 rope_scaling_type = rope_scaling_factor = 0.0 use_dynamic_ntk = 0 low_freq_factor = 1.0 high_freq_factor = 1.0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =

[TM][WARNING] [LlamaTritonModel] max_context_token_num = 32776. [TM][INFO] Barrier(1) [TM][INFO] Model: head_num: 28 kv_head_num: 4 size_per_head: 128 inter_size: 18944 num_layer: 28 vocab_size: 151666 attn_bias: 1 max_batch_size: 128 max_prefill_token_num: 8192 max_context_token_num: 32776 session_len: 32776 step_length: 1 cache_max_entry_count: 0.4 cache_block_seq_len: 64 cache_chunk_size: -1 enable_prefix_caching: 0 use_context_fmha: 1 start_id: 151644 tensor_para_size: 1 pipeline_para_size: 1 enable_custom_all_reduce: 0 model_name: model_dir: quant_policy: 0 group_size: 0

2024-08-23 11:39:01,327 - lmdeploy - WARNING - get 255 model params Convert to turbomind format: 0%| | 0/28 [00:00<?, ?it/s]Traceback (most recent call last): File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\miniconda3\envs\lmdeploy\Scripts\lmdeploy.exe__main.py", line 7, in sys.exit(run()) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\cli\entrypoint.py", line 36, in run args.run(args) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\cli\serve.py", line 273, in api_server run_api_server(args.model_path, File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\openai\api_server.py", line 923, in serve VariableInterface.async_engine = pipeline_class( File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\vl_async_engine.py", line 24, in init super().init(model_path, **kwargs) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\async_engine.py", line 155, in init__ self._build_turbomind(model_path=model_path, File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\async_engine.py", line 198, in _build_turbomind self.engine = tm.TurboMind.from_pretrained( File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\turbomind.py", line 282, in from_pretrained return cls(model_path=pretrained_model_name_or_path, File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\turbomind.py", line 102, in init self.model_comm = self._from_hf(model_source=model_source, File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\turbomind.py", line 201, in _from_hf tm_model.export() File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\base.py", line 289, in export self.export_transformer_block(bin, i) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\fp.py", line 64, in export_transformer_block qb, kb, vb, ob = transpose_tensor([qb, kb, vb, ob]) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\fp.py", line 13, in transpose_tensor output = [x.cuda().t() for x in input] File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\fp.py", line 13, in output = [x.cuda().t() for x in input] AttributeError: 'NoneType' object has no attribute 'cuda'

irexyc commented 2 months ago

@HSIAOKUOWEI

Sorry, it should be this commit https://github.com/zhyncs/lmdeploy-build/releases/tag/3ffb0c4

HSIAOKUOWEI commented 2 months ago

Bro, start MinniCPM-V-2_6 need to setting --rope-scaling-factor,because default value is None image

Can run (lmdeploy) C:\Users\mi_ap>lmdeploy serve api_server --server-name 0.0.0.0 --server-port 8000 --model-name MiniCPM-V-2_6 --cache-max-entry-count 0.4 --vision-max-batch-size 3 --log-level INFO --rope-scaling-factor 1.0 D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6 Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin, please note cuda version should >= 11.3 when compiled with cuda 11 2024-08-23 13:21:41,214 - lmdeploy - INFO - matching vision model: MiniCPMVModel 2024-08-23 13:21:51,986 - lmdeploy - INFO - using _forward_v2_6 D:\miniconda3\envs\lmdeploy\lib\site-packages\transformers\models\auto\image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead

Cannot run (lmdeploy) C:\Users\mi_ap>lmdeploy serve api_server --server-name 0.0.0.0 --server-port 8000 --model-name MiniCPM-V-2_6 --cache-max-entry-count 0.4 --vision-max-batch-size 3 --log-level INFO D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6 Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin, please note cuda version should >= 11.3 when compiled with cuda 11 Traceback (most recent call last): File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\miniconda3\envs\lmdeploy\Scripts\lmdeploy.exe__main.py", line 7, in sys.exit(run()) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\cli\entrypoint.py", line 36, in run args.run(args) File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\cli\serve.py", line 266, in api_server backend_config = TurbomindEngineConfig( File "D:\miniconda3\envs\lmdeploy\lib\site-packages\pydantic_internal_dataclasses.py", line 141, in init s.pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) pydantic_core._pydantic_core.ValidationError: 1 validation error for TurbomindEngineConfig rope_scaling_factor Input should be a valid number [type=float_type, input_value=None, input_type=NoneType] For further information visit https://errors.pydantic.dev/2.8/v/float_type

irexyc commented 2 months ago

@HSIAOKUOWEI

It seems a bug and will be fixed on our next release. https://github.com/InternLM/lmdeploy/pull/2362

lvhan028 commented 2 months ago

Please try the latest version v0.6.0a0