Closed HSIAOKUOWEI closed 2 months ago
shoud be supported in latest code, you can download from here
https://github.com/zhyncs/lmdeploy-build/releases/tag/3ffb0c4
The same , bro
(lmdeploy) C:\Users\mi_ap>pip install D:\lmdeploy-0.5.3+cu121+c1923f4-cp310-cp310-win_amd64.whl --force-reinstall --no-deps Defaulting to user installation because normal site-packages is not writeable Processing d:\lmdeploy-0.5.3+cu121+c1923f4-cp310-cp310-win_amd64.whl Installing collected packages: lmdeploy Attempting uninstall: lmdeploy Found existing installation: lmdeploy 0.5.3 Uninstalling lmdeploy-0.5.3: Successfully uninstalled lmdeploy-0.5.3 Successfully installed lmdeploy-0.5.3
(lmdeploy) C:\Users\mi_ap>lmdeploy serve api_server --server-name 0.0.0.0 --server-port 8000 --model-name MiniCPM-V-2_6 --cache-max-entry-count 0.4 --vision-max-batch-size 3 --log-level INFO D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6
Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin, please note cuda version should >= 11.3 when compiled with cuda 11
2024-08-23 11:37:44,635 - lmdeploy - INFO - matching vision model: MiniCPMVModel
D:\miniconda3\envs\lmdeploy\lib\site-packages\transformers\models\auto\image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class
, or fast_image_processor_class
instead
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-08-23 11:39:00,626 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_format=None, tp=1, session_len=None, max_batch_size=128, cache_max_entry_count=0.4, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-08-23 11:39:00,626 - lmdeploy - INFO - input chat_template_config=None
2024-08-23 11:39:00,627 - lmdeploy - WARNING - Did not find a chat template matching D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6.
2024-08-23 11:39:00,683 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='base', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-08-23 11:39:00,683 - lmdeploy - INFO - model_source: hf_model
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-08-23 11:39:01,185 - lmdeploy - INFO - model_config:
[llama] model_name = chat_template = model_arch = MiniCPMV tensor_para_size = 1 head_num = 28 kv_head_num = 4 vocab_size = 151666 num_layer = 28 inter_size = 18944 norm_eps = 1e-06 attn_bias = 1 start_id = 151644 end_id = 151645 session_len = 32776 weight_type = bf16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 128 max_prefill_token_num = 8192 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.4 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 5 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 original_max_position_embeddings = 0 rope_scaling_type = rope_scaling_factor = 0.0 use_dynamic_ntk = 0 low_freq_factor = 1.0 high_freq_factor = 1.0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =
[TM][WARNING] [LlamaTritonModel] max_context_token_num
= 32776.
[TM][INFO] Barrier(1)
[TM][INFO] Model:
head_num: 28
kv_head_num: 4
size_per_head: 128
inter_size: 18944
num_layer: 28
vocab_size: 151666
attn_bias: 1
max_batch_size: 128
max_prefill_token_num: 8192
max_context_token_num: 32776
session_len: 32776
step_length: 1
cache_max_entry_count: 0.4
cache_block_seq_len: 64
cache_chunk_size: -1
enable_prefix_caching: 0
use_context_fmha: 1
start_id: 151644
tensor_para_size: 1
pipeline_para_size: 1
enable_custom_all_reduce: 0
model_name:
model_dir:
quant_policy: 0
group_size: 0
2024-08-23 11:39:01,327 - lmdeploy - WARNING - get 255 model params
Convert to turbomind format: 0%| | 0/28 [00:00<?, ?it/s]Traceback (most recent call last):
File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\miniconda3\envs\lmdeploy\Scripts\lmdeploy.exe__main.py", line 7, in
@HSIAOKUOWEI
Sorry, it should be this commit https://github.com/zhyncs/lmdeploy-build/releases/tag/3ffb0c4
Bro, start MinniCPM-V-2_6 need to setting --rope-scaling-factor,because default value is None
Can run
(lmdeploy) C:\Users\mi_ap>lmdeploy serve api_server --server-name 0.0.0.0 --server-port 8000 --model-name MiniCPM-V-2_6 --cache-max-entry-count 0.4 --vision-max-batch-size 3 --log-level INFO --rope-scaling-factor 1.0 D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6
Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin, please note cuda version should >= 11.3 when compiled with cuda 11
2024-08-23 13:21:41,214 - lmdeploy - INFO - matching vision model: MiniCPMVModel
2024-08-23 13:21:51,986 - lmdeploy - INFO - using _forward_v2_6
D:\miniconda3\envs\lmdeploy\lib\site-packages\transformers\models\auto\image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class
, or fast_image_processor_class
instead
Cannot run
(lmdeploy) C:\Users\mi_ap>lmdeploy serve api_server --server-name 0.0.0.0 --server-port 8000 --model-name MiniCPM-V-2_6 --cache-max-entry-count 0.4 --vision-max-batch-size 3 --log-level INFO D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6
Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin, please note cuda version should >= 11.3 when compiled with cuda 11
Traceback (most recent call last):
File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\miniconda3\envs\lmdeploy\Scripts\lmdeploy.exe__main.py", line 7, in
@HSIAOKUOWEI
It seems a bug and will be fixed on our next release. https://github.com/InternLM/lmdeploy/pull/2362
Please try the latest version v0.6.0a0
Motivation
I cannot deploy openbmb/MiniCPM-V-2_6 now, can you support it? 我現在無法部署openbmb/MiniCPM-V-2_6,可以支援嗎
Related resources
(lmdeploy) C:\Users\mi_ap>lmdeploy serve api_server --server-port 8000 --log-level INFO --backend turbomind --cache-max-entry-count 0.2 --model-name MiniCPM-V-2-5 --vision-max-batch-size 2 D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6 Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin, please note cuda version should >= 11.3 when compiled with cuda 11 2024-08-21 16:55:36,332 - lmdeploy - INFO - matching vision model: MiniCPMVModel D:\miniconda3\envs\lmdeploy\lib\site-packages\transformers\models\auto\image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use
slow_image_processor_class
, orfast_image_processor_class
instead warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-08-21 16:55:47,863 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='MiniCPM-V-2-5', model_format=None, tp=1, session_len=None, max_batch_size=128, cache_max_entry_count=0.2, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-08-21 16:55:47,863 - lmdeploy - INFO - input chat_template_config=None 2024-08-21 16:55:47,863 - lmdeploy - WARNING - Did not find a chat template matching D:\LLM_Project\Baseline_Multimodal_Model\minicpm-v2.6\MiniCPM-V-2_6. 2024-08-21 16:55:47,877 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='base', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-08-21 16:55:47,877 - lmdeploy - INFO - model_source: hf_model 2024-08-21 16:55:47,877 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-08-21 16:55:48,348 - lmdeploy - INFO - model_config:[llama] model_name = base model_arch = MiniCPMV tensor_para_size = 1 head_num = 28 kv_head_num = 4 vocab_size = 151666 num_layer = 28 inter_size = 18944 norm_eps = 1e-06 attn_bias = 1 start_id = 151644 end_id = 151645 session_len = 32776 weight_type = bf16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 128 max_prefill_token_num = 8192 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.2 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 5 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 original_max_position_embeddings = 0 rope_scaling_type = rope_scaling_factor = 0.0 use_dynamic_ntk = 0 low_freq_factor = 1.0 high_freq_factor = 1.0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =
[TM][WARNING] [LlamaTritonModel]
max_context_token_num
= 32776. [TM][INFO] Barrier(1) [TM][INFO] Model: head_num: 28 kv_head_num: 4 size_per_head: 128 inter_size: 18944 num_layer: 28 vocab_size: 151666 attn_bias: 1 max_batch_size: 128 max_prefill_token_num: 8192 max_context_token_num: 32776 session_len: 32776 step_length: 1 cache_max_entry_count: 0.2 cache_block_seq_len: 64 cache_chunk_size: -1 enable_prefix_caching: 0 use_context_fmha: 1 start_id: 151644 tensor_para_size: 1 pipeline_para_size: 1 enable_custom_all_reduce: 0 model_name: base model_dir: quant_policy: 0 group_size: 02024-08-21 16:55:48,490 - lmdeploy - WARNING - get 255 model params Convert to turbomind format: 0%| | 0/28 [00:00<?, ?it/s]Traceback (most recent call last): File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\miniconda3\envs\lmdeploy\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\miniconda3\envs\lmdeploy\Scripts\lmdeploy.exe__main.py", line 7, in
sys.exit(run())
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\cli\entrypoint.py", line 36, in run
args.run(args)
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\cli\serve.py", line 298, in api_server
run_api_server(args.model_path,
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\openai\api_server.py", line 1285, in serve
VariableInterface.async_engine = pipeline_class(
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\vl_async_engine.py", line 24, in init
super().init(model_path, **kwargs)
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\async_engine.py", line 190, in init__
self._build_turbomind(model_path=model_path,
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\serve\async_engine.py", line 235, in _build_turbomind
self.engine = tm.TurboMind.from_pretrained(
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\turbomind.py", line 340, in from_pretrained
return cls(model_path=pretrained_model_name_or_path,
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\turbomind.py", line 144, in init
self.model_comm = self._from_hf(model_source=model_source,
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\turbomind.py", line 257, in _from_hf
output_model.export()
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\base.py", line 289, in export
self.export_transformer_block(bin, i)
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\fp.py", line 65, in export_transformer_block
qb, kb, vb, ob = transpose_tensor([qb, kb, vb, ob])
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\fp.py", line 13, in transpose_tensor
output = [x.cuda().t() for x in input]
File "C:\Users\mi_ap\AppData\Roaming\Python\Python310\site-packages\lmdeploy\turbomind\deploy\target_model\fp.py", line 13, in
output = [x.cuda().t() for x in input]
AttributeError: 'NoneType' object has no attribute 'cuda'
Additional context
No response