Closed JennieGao-njust closed 6 months ago
While measuring the latency of LLaVA-v1.5-TinyLLaMA-1.1B, all of the settings of llama.cpp
are default, which means no additional args those will influence performance are set. No specific optimization to this.
The device is Realme GT with 8GB RAM and Snapdragon 888 chip.
I first install termux
into the phone and than deploy the inference according to here. Maybe you can set -t
to modify the multi-thread setting, but we did no set it.
Maybe you can provide more information, but first try our MobileVLM with its deployment on Android device! https://github.com/Meituan-AutoML/MobileVLM#-deployment-on-mobile-devices-
~/mobile_llama/llama.cpp $ ./llava-cli -m ~/ggml-model-tinyllama-1b-q4_k.gguf --mmproj ~/mmproj-model-tinyllama-1b-f16.gguf --image ~/Boston-Terrier.jpg -p "what is in the picture?"
clip_model_load: model name: openai/clip-vit-large-patch14-336
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 377
clip_model_load: n_kv: 18
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 18 key-value pairs and 377 tensors from /data/data/com.termux/files/home/mmproj-model-tinyllama-1b-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336
clip_model_load: - kv 6: general.description str = image encoder for LLaVA
clip_model_load: - kv 7: clip.vision.image_size u32 = 336
clip_model_load: - kv 8: clip.vision.patch_size u32 = 14
clip_model_load: - kv 9: clip.vision.embedding_length u32 = 1024
clip_model_load: - kv 10: clip.vision.feed_forward_length u32 = 4096
clip_model_load: - kv 11: clip.vision.projection_dim u32 = 768
clip_model_load: - kv 12: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 13: clip.vision.attention.layer_norm_epsilon f32 = 0.000010
clip_model_load: - kv 14: clip.vision.block_count u32 = 23
clip_model_load: - kv 15: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv 16: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv 17: clip.use_gelu bool = false
clip_model_load: - type f32: 235 tensors
clip_model_load: - type f16: 142 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 567.52 MB
clip_model_load: metadata size: 0.14 MB
clip_model_load: params backend buffer size = 567.52 MB (377 tensors)
clip_model_load: compute allocated memory: 32.89 MB
llama_model_loader: loaded meta data with 21 key-value pairs and 201 tensors from /data/data/com.termux/files/home/ggml-model-tinyllama-1b-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = llava_models
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 22
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 14
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type q4_K: 147 tensors
llama_model_loader: - type q5_K: 8 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 612.28 MiB (4.67 BPW)
llm_load_print_meta: general.name = llava_models
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
encode_image_with_clip: image encoded in 4453.33 ms by CLIP ( 7.73 ms per image patch)
A dog is wearing a hat and a scarf.
llama_print_timings: load time = 5595.88 ms llama_print_timings: sample time = 3.22 ms / 14 runs ( 0.23 ms per token, 4349.18 tokens per second) llama_print_timings: prompt eval time = 14747.20 ms / 622 tokens ( 23.71 ms per token, 42.18 tokens per second) llama_print_timings: eval time = 583.14 ms / 14 runs ( 41.65 ms per token, 24.01 tokens per second) llama_print_timings: total time = 20149.02 ms ~/mobile_llama/llama.cpp
你用的是哪款手机,需要 Root 么
你用的是哪款手机,需要 Root 么
三星S22 ultra,不需要 用的termux 运行的
While measuring the latency of LLaVA-v1.5-TinyLLaMA-1.1B, all of the settings of
llama.cpp
are default, which means no additional args those will influence performance are set. No specific optimization to this. The device is Realme GT with 8GB RAM and Snapdragon 888 chip. I first installtermux
into the phone and than deploy the inference according to here. Maybe you can set-t
to modify the multi-thread setting, but we did no set it.Maybe you can provide more information, but first try our MobileVLM with its deployment on Android device! https://github.com/Meituan-AutoML/MobileVLM#-deployment-on-mobile-devices-
这个是我在骁龙8gen1上运行 mobileVLM的时间Log,仍然出入较大的是clip的patch的平均时间;这里很困惑是由于设备性能的差异吗?
我是在termux上运行的,运行命令是:
./llava-cli \
-m ./models/ggml-model-q4_k.gguf \
--mmproj ./models/mmproj-model-f16.gguf \
--image /data/local/tmp/cat.jpeg \
-p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
clip_model_load: model name: openai/clip-vit-large-patch14-336 clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 397 clip_model_load: n_kv: 19 clip_model_load: ftype: f16
clip_model_load: loaded meta data with 19 key-value pairs and 397 tensors from models/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336
clip_model_load: - kv 6: general.description str = image encoder for LLaVA
clip_model_load: - kv 7: clip.projector_type str = ldp
clip_model_load: - kv 8: clip.vision.image_size u32 = 336
clip_model_load: - kv 9: clip.vision.patch_size u32 = 14
clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1024
clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4096
clip_model_load: - kv 12: clip.vision.projection_dim u32 = 768
clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000010
clip_model_load: - kv 15: clip.vision.block_count u32 = 23
clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv 18: clip.use_gelu bool = false
clip_model_load: - type f32: 247 tensors
clip_model_load: - type f16: 150 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 591.67 MB
clip_model_load: metadata size: 0.15 MB
clip_model_load: params backend buffer size = 591.67 MB (397 tensors)
clip_model_load: compute allocated memory: 32.89 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 219 tensors from models/ggml-model-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = VLM
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 24
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 16
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 16
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 14
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 49 tensors
llama_model_loader: - type q4_K: 162 tensors
llama_model_loader: - type q5_K: 7 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 1.36 B
llm_load_print_meta: model size = 754.43 MiB (4.64 BPW)
llm_load_print_meta: general.name = VLM
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
encode_image_with_clip: image encoded in 22244.86 ms by CLIP ( 154.48 ms per image patch)
The image features a woman wearing glasses and holding an open hat in her hand. She is standing next to the camera, with flowers visible behind them both as background elements on top of red wallpaper or magazine pages covering part 2/4ths off from either side (from left).
llama_print_timings: load time = 24942.07 ms llama_print_timings: sample time = 15.84 ms / 61 runs ( 0.26 ms per token, 3850.04 tokens per second) llama_print_timings: prompt eval time = 11478.69 ms / 232 tokens ( 49.48 ms per token, 20.21 tokens per second) llama_print_timings: eval time = 3691.69 ms / 61 runs ( 60.52 ms per token, 16.52 tokens per second) llama_print_timings: total time = 38495.57 ms / 293 tokens ~/llama.cpp-MobileVLM $ vim run.sh ~/llama.cpp-MobileVLM $ sh run.sh clip_model_load: model name: openai/clip-vit-large-patch14-336 clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 397 clip_model_load: n_kv: 19 clip_model_load: ftype: f16
clip_model_load: loaded meta data with 19 key-value pairs and 397 tensors from models/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336
clip_model_load: - kv 6: general.description str = image encoder for LLaVA
clip_model_load: - kv 7: clip.projector_type str = ldp
clip_model_load: - kv 8: clip.vision.image_size u32 = 336
clip_model_load: - kv 9: clip.vision.patch_size u32 = 14
clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1024
clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4096
clip_model_load: - kv 12: clip.vision.projection_dim u32 = 768
clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000010
clip_model_load: - kv 15: clip.vision.block_count u32 = 23
clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv 18: clip.use_gelu bool = false
clip_model_load: - type f32: 247 tensors
clip_model_load: - type f16: 150 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 591.67 MB
clip_model_load: metadata size: 0.15 MB
clip_model_load: params backend buffer size = 591.67 MB (397 tensors)
clip_model_load: compute allocated memory: 32.89 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 219 tensors from models/ggml-model-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = VLM
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 24
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 16
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 16
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 14
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 49 tensors
llama_model_loader: - type q4_K: 162 tensors
llama_model_loader: - type q5_K: 7 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 1.36 B
llm_load_print_meta: model size = 754.43 MiB (4.64 BPW)
llm_load_print_meta: general.name = VLM
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
encode_image_with_clip: image encoded in 20666.23 ms by CLIP ( 143.52 ms per image patch)
The image features a woman with long hair, dressed in an elegant hat and holding up her head. She is posing for the camera while smiling or wearing sunglasses over one eye which adds to this picture's overall style of vintage charm.
The latency of CLIP here should /4
due to the limit of customized llama.cpp
, which almost equals to your previous result of official llama.cpp
. I was wondering whether the time was consumed by loading the weights of CLIP or the indeed encoding process. Maybe you could set some timer in the code to check the specific time cost. It is weird. Maybe there are some limitation on hardwares...
The latency of CLIP here should
/4
due to the limit of customizedllama.cpp
, which almost equals to your previous result of officialllama.cpp
. I was wondering whether the time was consumed by loading the weights of CLIP or the indeed encoding process. Maybe you could set some timer in the code to check the specific time cost. It is weird. Maybe there are some limitation on hardwares...
经过验证,encode_image_with_clip: image encoded in 20666.23 ms by CLIP ( 143.52 ms per image patch) 这里的确是mobileVLM 在骁龙gen1上的image encode时间,并且这个时间与https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/MobileVLM-README.md#some-result-on-android-with-snapdragon-888-chip 是一致的;但是与论文中的6-8ms/patches不一致; 这两种是因为什么造成的差异呢?
The latency of CLIP here should
/4
due to the limit of customizedllama.cpp
, which almost equals to your previous result of officialllama.cpp
. I was wondering whether the time was consumed by loading the weights of CLIP or the indeed encoding process. Maybe you could set some timer in the code to check the specific time cost. It is weird. Maybe there are some limitation on hardwares...
并且为了方便测试,我在相同的Linux环境进行了clip时间阶段的对比, 如果采用mobileVLM, encode_image_with_clip: image encoded in 1319.52 ms by CLIP ( 9.16 ms per image patch) 如果采用tinyLLama,encode_image_with_clip: image encoded in 1225.62 ms by CLIP ( 2.13 ms per image patch) 虽然,LDP结构降低了token的数量,但是平均处理每个patch的时间变长了?总时间还是与未采用之前一致。 这里还是很困惑,期待您的回复
The latency of CLIP here should
/4
due to the limit of customizedllama.cpp
, which almost equals to your previous result of officialllama.cpp
. I was wondering whether the time was consumed by loading the weights of CLIP or the indeed encoding process. Maybe you could set some timer in the code to check the specific time cost. It is weird. Maybe there are some limitation on hardwares...
是否跟转换的LDP大小有关? 按照论文中所有LDP的参数两是20M,但是我按照说明转换出来的llava.projector 为73M;请问实际llava.projector 的大小应该是20M吗?
This issue has been answered in another issue #23 , please refer to it for more details.
@JennieGao-njust 我跑mobilevlm 1
8gen3 encode_image_with_clip: image embedding created: 144 tokens encode_image_with_clip: image encoded in 7298.78 ms by CLIP ( 50.69 ms per image patch)
888 encode_image_with_clip: image embedding created: 144 tokens encode_image_with_clip: image encoded in 9322.85 ms by CLIP ( 64.74 ms per image patch)
9322.85 ms by CLIP ( 64.74 ms per image patch) 这里我当时还是没有弄明白我的差异在那里,并且采用tinyllama虽然token多,但是平均平均batch时间更快;现在没有在继续做了。
下面的时间是我llm的部分替换为tinyLlama之后,在骁龙gen1上运行的时间log, 其他部分仍然是LLava的结构: encode_image_with_clip: image encoded in 18963.98 ms by CLIP ( 32.92 ms per image patch)
The image on your post is a woman in a hat and a blue frill with feathers.
llama_print_timings: load time = 19968.38 ms llama_print_timings: sample time = 4.61 ms / 21 runs ( 0.22 ms per token, 4554.33 tokens per second) llama_print_timings: prompt eval time = 29440.51 ms / 616 tokens ( 47.79 ms per token, 20.92 tokens per second) llama_print_timings: eval time = 1051.03 ms / 21 runs ( 50.05 ms per token, 19.98 tokens per second) llama_print_timings: total time = 49528.13 ms
实际体验起来, 由于Vit部分 最少也要576(336/14)token,因此第一次运行等待时间体感过于长,对照MobileVLM中的1.4B,34.93token/s 处理576也需要17s左右,请问这里是否有空间再处理?
另外,mobileVLM 论文中的Vision处理部分是6~8ms/patch,而我用llama.cpp的encode是32.92ms/patch,请问这里是做了后端并行优化吗?请多多指教