Meituan-AutoML / MobileVLM

Strong and Open Vision Language Assistant for Mobile Devices
Apache License 2.0
890 stars 64 forks source link

clip部分 ldp 跟 直接编码mlp的时间对比 #23

Closed JennieGao-njust closed 4 months ago

JennieGao-njust commented 5 months ago

https://github.com/Meituan-AutoML/MobileVLM/issues/13 与其中的疑问一致;

  1. encode_image的速度; encode_image_with_clip: image encoded in 20666.23 ms by CLIP ( 143.52 ms per image patch) 这是mobileVLM 在骁龙gen1上的image encode时间,并且这个时间与https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/MobileVLM-README.md#some-result-on-android-with-snapdragon-888-chip 是一致的;但是与论文中的6-8ms/patches(骁龙888)不一致; https://private-user-images.githubusercontent.com/96759404/301446311-7a451e58-8fa2-4c45-8edc-7d998686bb62.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDcyMDY3NzMsIm5iZiI6MTcwNzIwNjQ3MywicGF0aCI6Ii85Njc1OTQwNC8zMDE0NDYzMTEtN2E0NTFlNTgtOGZhMi00YzQ1LThlZGMtN2Q5OTg2ODZiYjYyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAyMDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMjA2VDA4MDExM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdlNjBmMWI1Mjg4OGJiMjRkZGM0MDViNDhiMTJlYTU2NTY5NTliMzUxOTc3NTk1MTc1NTMwN2Y0NDQyYWFmNzQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.TEJufhdNyrLFXmtqv6uX1ag49TqOu6SzKoDQoR2jIPA 这两种是因为什么造成的差异呢?

  2. 我在相同的Linux环境进行了clip时间阶段的对比, 如果采用mobileVLM, encode_image_with_clip: image encoded in 1319.52 ms by CLIP ( 9.16 ms per image patch) 如果采用tinyLLama,encode_image_with_clip: image encoded in 1225.62 ms by CLIP ( 2.13 ms per image patch) 虽然,LDP结构降低了token的数量,但是平均处理每个patch的时间变长了?总时间还是与未采用之前一致。 按我的理解,LDP 应该比mlp快?

3.按照论文中所有LDP的参数量是20M,但是我按照说明转换出来的llava.projector 为73M;请问实际llava.projector 的大小应该是20M吗?

这里十分困惑,期待您的解答

YangYang-DLUT commented 4 months ago

A1: The log of official llama.cpp with LLaVA-v1.5-TinyLLaMA 1.1B on Realme GT with Snapdragon 888 has been provided here. If you need the log of MobileVLM even the screen recording, I can provide. The result on that github page only indicate the inference on that specific device, has nothing to do with the result in paper. I think the performance depends on the interaction between the specific device and llama.cpp, which beyond the discussion of our paper. I do not know the reasons for now.

A2 (About the vision encoding speed.): Like I mentioned before, there if a small short coming of our customized llama.cpp: the actual encoding speed should (9.16 ms per image patch) should divide by 4 (9.16 / 4 = 2.29 ms per image patch). The original structure of LLaVA projector is two mlps, which convert the 576 image patches into 576 visual prompts, then feed them into the following LLM as part of input tokens. LDP add extra blocks after two mlps to convert the original 576 image patches to 144 visual tokens, 1/4 of the original projector, which means the following LLM only need process 1/4 visual tokens and can get better performance both on inference speed and conversation quality, since that the LLM takes the most of time consumption. The conclusion is that LDP is slower than 2x mlps but the difference is insignificant compared with the benefit. encode_image_with_clip is the total time for clip to encoding the input image to 576 patches plus the time for projector to convert the image patches, not only the time for projector. You can see that LDP add 0.16 "ms" than 2*mlp, which means LDP has very efficient structure.

A3 (About the parameter scale of LDP): The parameter scale for LDP is 20M for sure. The size of llava.projector converted by llama.cpp is not the same conception with parameter scale.

YangYang-DLUT commented 4 months ago

The inference log of MobileVLM v2 3B with llama.cpp on Realme GT with Snapdragon 888:

~/mobile_v2_infer $ ./llama.cpp/llava-cli -m mobilvlm_v2_3b/ggml-model-mvlmv2_3b-q4_k.gguf --mmproj mobilvlm_v2_3b/mmproj-model-f16.gguf --image ../Boston-Terrier.jpg
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    379
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 379 tensors from mobilvlm_v2_3b/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = peg
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  236 tensors
clip_model_load: - type  f16:  143 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     573.07 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  573.07 MB (379 tensors)
clip_model_load: compute allocated memory: 36.18 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from mobilvlm_v2_3b/ggml-model-mvlmv2_3b-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = ..
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2560
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 6912
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 80
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 14
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  217 tensors
llama_model_loader: - type q5_K:    8 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 80
llm_load_print_meta: n_embd_head_k    = 80
llm_load_print_meta: n_embd_head_v    = 80
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 6912
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 2.70 B
llm_load_print_meta: model size       = 1.45 GiB (4.60 BPW)
llm_load_print_meta: general.name     = ..
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1481.48 MiB
...............................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   640.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   157.30 MiB
llama_new_context_with_model: graph splits (measure): 1

encode_image_with_clip: image encoded in  4493.99 ms by CLIP (   31.21 ms per image patch)

 In the image, a black and white Boston Terrier dog is the main subject. The dog's head is resting on its paws, giving an impression of relaxation or perhaps contentment. It's wearing a yellow bandana adorned with green polka dots, adding a touch of color to its appearance.

The dog is comfortably nestled in a red bone-shaped pet bed, which contrasts with the white wooden chair it's sitting on. The chair is positioned next to a pile of blue and yellow stuffed animals, suggesting a playful or festive atmosphere. The background is blurred, drawing focus to the dog and its immediate surroundings. There's no text present in the image.

The relative positions of the objects are such that the dog is in front of the pet bed on the chair, which is next to a pile of stuffed animals. The overall scene suggests a cozy and comfortable environment for the Boston Terrier.

Please note that this description is based on the visible elements in the image and does not include any speculative or imaginary content.

Image Details: The dog appears to be the only living creature in the image

llama_print_timings:        load time =    9146.31 ms
llama_print_timings:      sample time =      61.20 ms /   256 runs   (    0.24 ms per token,  4182.94 tokens per second)
llama_print_timings: prompt eval time =   12064.78 ms /   184 tokens (   65.57 ms per token,    15.25 tokens per second)
llama_print_timings:        eval time =   18764.22 ms /   256 runs   (   73.30 ms per token,    13.64 tokens per second)
llama_print_timings:       total time =   36539.63 ms /   440 tokens

encode_image_with_clip: image encoded in 4493.99 ms by CLIP ( 31.21 ms per image patch) 4493.99 ms is the time consumption for clip to encode the image to 576 image patches then projector convert 576 patches to 144 visual tokens, but the speed is calculated by 4493.99 / 144 = 31.21, which should be devided by 4 to 31.21 / 4 = 7.80 ms per image patch, since than the visual encoder actually processed 576 image patches.

er-muyue commented 4 months ago

Hi, we are closing this issue due to the inactivity. Hope your question has been resolved. If you have any further concerns, please feel free to re-open it or open a new issue. Thanks!