关于手机上运行时间的求教

JennieGao-njust commented 8 months ago

下面的时间是我llm的部分替换为tinyLlama之后，在骁龙gen1上运行的时间log, 其他部分仍然是LLava的结构： encode_image_with_clip: image encoded in 18963.98 ms by CLIP ( 32.92 ms per image patch)

The image on your post is a woman in a hat and a blue frill with feathers.

llama_print_timings: load time = 19968.38 ms llama_print_timings: sample time = 4.61 ms / 21 runs ( 0.22 ms per token, 4554.33 tokens per second) llama_print_timings: prompt eval time = 29440.51 ms / 616 tokens ( 47.79 ms per token, 20.92 tokens per second) llama_print_timings: eval time = 1051.03 ms / 21 runs ( 50.05 ms per token, 19.98 tokens per second) llama_print_timings: total time = 49528.13 ms

实际体验起来，由于Vit部分最少也要576（336/14）token，因此第一次运行等待时间体感过于长，对照MobileVLM中的1.4B，34.93token/s 处理576也需要17s左右，请问这里是否有空间再处理？

另外，mobileVLM 论文中的Vision处理部分是6~8ms/patch，而我用llama.cpp的encode是32.92ms/patch,请问这里是做了后端并行优化吗？请多多指教

YangYang-DLUT commented 8 months ago

While measuring the latency of LLaVA-v1.5-TinyLLaMA-1.1B, all of the settings of llama.cpp are default, which means no additional args those will influence performance are set. No specific optimization to this. The device is Realme GT with 8GB RAM and Snapdragon 888 chip. I first install termux into the phone and than deploy the inference according to here. Maybe you can set -t to modify the multi-thread setting, but we did no set it.

Maybe you can provide more information, but first try our MobileVLM with its deployment on Android device! https://github.com/Meituan-AutoML/MobileVLM#-deployment-on-mobile-devices-

YangYang-DLUT commented 8 months ago

~/mobile_llama/llama.cpp $ ./llava-cli -m ~/ggml-model-tinyllama-1b-q4_k.gguf --mmproj ~/mmproj-model-tinyllama-1b-f16.gguf --image ~/Boston-Terrier.jpg -p "what is in the picture?"
clip_model_load: model name: openai/clip-vit-large-patch14-336
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 377
clip_model_load: n_kv: 18
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 18 key-value pairs and 377 tensors from /data/data/com.termux/files/home/mmproj-model-tinyllama-1b-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. clip_model_load: - kv 0: general.architecture str = clip clip_model_load: - kv 1: clip.has_text_encoder bool = false clip_model_load: - kv 2: clip.has_vision_encoder bool = true clip_model_load: - kv 3: clip.has_llava_projector bool = true clip_model_load: - kv 4: general.file_type u32 = 1 clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336 clip_model_load: - kv 6: general.description str = image encoder for LLaVA clip_model_load: - kv 7: clip.vision.image_size u32 = 336 clip_model_load: - kv 8: clip.vision.patch_size u32 = 14 clip_model_load: - kv 9: clip.vision.embedding_length u32 = 1024 clip_model_load: - kv 10: clip.vision.feed_forward_length u32 = 4096 clip_model_load: - kv 11: clip.vision.projection_dim u32 = 768 clip_model_load: - kv 12: clip.vision.attention.head_count u32 = 16 clip_model_load: - kv 13: clip.vision.attention.layer_norm_epsilon f32 = 0.000010 clip_model_load: - kv 14: clip.vision.block_count u32 = 23 clip_model_load: - kv 15: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211] clip_model_load: - kv 16: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777] clip_model_load: - kv 17: clip.use_gelu bool = false clip_model_load: - type f32: 235 tensors clip_model_load: - type f16: 142 tensors clip_model_load: CLIP using CPU backend clip_model_load: text_encoder: 0 clip_model_load: vision_encoder: 1 clip_model_load: llava_projector: 1 clip_model_load: model size: 567.52 MB clip_model_load: metadata size: 0.14 MB clip_model_load: params backend buffer size = 567.52 MB (377 tensors) clip_model_load: compute allocated memory: 32.89 MB llama_model_loader: loaded meta data with 21 key-value pairs and 201 tensors from /data/data/com.termux/files/home/ggml-model-tinyllama-1b-q4_k.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = llava_models llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.block_count u32 = 22 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 14 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 45 tensors llama_model_loader: - type q4_K: 147 tensors llama_model_loader: - type q5_K: 8 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 22 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q4_K - Small llm_load_print_meta: model params = 1.10 B llm_load_print_meta: model size = 612.28 MiB (4.67 BPW) llm_load_print_meta: general.name = llava_models llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.08 MiB llm_load_tensors: system memory used = 612.35 MiB ..................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB llama_build_graph: non-view tensors processed: 466/466 llama_new_context_with_model: compute buffer total size = 147.19 MiB

encode_image_with_clip: image encoded in 4453.33 ms by CLIP ( 7.73 ms per image patch)

A dog is wearing a hat and a scarf.

llama_print_timings: load time = 5595.88 ms llama_print_timings: sample time = 3.22 ms / 14 runs ( 0.23 ms per token, 4349.18 tokens per second) llama_print_timings: prompt eval time = 14747.20 ms / 622 tokens ( 23.71 ms per token, 42.18 tokens per second) llama_print_timings: eval time = 583.14 ms / 14 runs ( 41.65 ms per token, 24.01 tokens per second) llama_print_timings: total time = 20149.02 ms ~/mobile_llama/llama.cpp

alone4u1314 commented 8 months ago

你用的是哪款手机，需要 Root 么

JennieGao-njust commented 8 months ago

你用的是哪款手机，需要 Root 么

三星S22 ultra，不需要用的termux 运行的

JennieGao-njust commented 8 months ago

While measuring the latency of LLaVA-v1.5-TinyLLaMA-1.1B, all of the settings of llama.cpp are default, which means no additional args those will influence performance are set. No specific optimization to this. The device is Realme GT with 8GB RAM and Snapdragon 888 chip. I first install termux into the phone and than deploy the inference according to here. Maybe you can set -t to modify the multi-thread setting, but we did no set it.

Maybe you can provide more information, but first try our MobileVLM with its deployment on Android device! https://github.com/Meituan-AutoML/MobileVLM#-deployment-on-mobile-devices-

这个是我在骁龙8gen1上运行 mobileVLM的时间Log，仍然出入较大的是clip的patch的平均时间；这里很困惑是由于设备性能的差异吗？我是在termux上运行的，运行命令是： ./llava-cli \ -m ./models/ggml-model-q4_k.gguf \ --mmproj ./models/mmproj-model-f16.gguf \ --image /data/local/tmp/cat.jpeg \ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nWhat is in the image? ASSISTANT:

clip_model_load: model name: openai/clip-vit-large-patch14-336 clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 397 clip_model_load: n_kv: 19 clip_model_load: ftype: f16

clip_model_load: loaded meta data with 19 key-value pairs and 397 tensors from models/mmproj-model-f16.gguf clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. clip_model_load: - kv 0: general.architecture str = clip clip_model_load: - kv 1: clip.has_text_encoder bool = false clip_model_load: - kv 2: clip.has_vision_encoder bool = true clip_model_load: - kv 3: clip.has_llava_projector bool = true clip_model_load: - kv 4: general.file_type u32 = 1 clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336 clip_model_load: - kv 6: general.description str = image encoder for LLaVA clip_model_load: - kv 7: clip.projector_type str = ldp clip_model_load: - kv 8: clip.vision.image_size u32 = 336 clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1024 clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4096 clip_model_load: - kv 12: clip.vision.projection_dim u32 = 768 clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000010 clip_model_load: - kv 15: clip.vision.block_count u32 = 23 clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211] clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777] clip_model_load: - kv 18: clip.use_gelu bool = false clip_model_load: - type f32: 247 tensors clip_model_load: - type f16: 150 tensors clip_model_load: CLIP using CPU backend clip_model_load: text_encoder: 0 clip_model_load: vision_encoder: 1 clip_model_load: llava_projector: 1 clip_model_load: model size: 591.67 MB clip_model_load: metadata size: 0.15 MB clip_model_load: params backend buffer size = 591.67 MB (397 tensors) clip_model_load: compute allocated memory: 32.89 MB llama_model_loader: loaded meta data with 22 key-value pairs and 219 tensors from models/ggml-model-q4_k.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = VLM llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.block_count u32 = 24 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 16 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 16 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 14 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 49 tensors llama_model_loader: - type q4_K: 162 tensors llama_model_loader: - type q5_K: 7 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_K - Small llm_load_print_meta: model params = 1.36 B llm_load_print_meta: model size = 754.43 MiB (4.64 BPW) llm_load_print_meta: general.name = VLM llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.08 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/25 layers to GPU llm_load_tensors: CPU buffer size = 754.43 MiB ........................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: graph splits (measure): 1 llama_new_context_with_model: CPU compute buffer size = 80.00 MiB

encode_image_with_clip: image encoded in 22244.86 ms by CLIP ( 154.48 ms per image patch)

The image features a woman wearing glasses and holding an open hat in her hand. She is standing next to the camera, with flowers visible behind them both as background elements on top of red wallpaper or magazine pages covering part 2/4ths off from either side (from left).

llama_print_timings: load time = 24942.07 ms llama_print_timings: sample time = 15.84 ms / 61 runs ( 0.26 ms per token, 3850.04 tokens per second) llama_print_timings: prompt eval time = 11478.69 ms / 232 tokens ( 49.48 ms per token, 20.21 tokens per second) llama_print_timings: eval time = 3691.69 ms / 61 runs ( 60.52 ms per token, 16.52 tokens per second) llama_print_timings: total time = 38495.57 ms / 293 tokens ~/llama.cpp-MobileVLM $ vim run.sh ~/llama.cpp-MobileVLM $ sh run.sh clip_model_load: model name: openai/clip-vit-large-patch14-336 clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 397 clip_model_load: n_kv: 19 clip_model_load: ftype: f16

clip_model_load: loaded meta data with 19 key-value pairs and 397 tensors from models/mmproj-model-f16.gguf clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. clip_model_load: - kv 0: general.architecture str = clip clip_model_load: - kv 1: clip.has_text_encoder bool = false clip_model_load: - kv 2: clip.has_vision_encoder bool = true clip_model_load: - kv 3: clip.has_llava_projector bool = true clip_model_load: - kv 4: general.file_type u32 = 1 clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336 clip_model_load: - kv 6: general.description str = image encoder for LLaVA clip_model_load: - kv 7: clip.projector_type str = ldp clip_model_load: - kv 8: clip.vision.image_size u32 = 336 clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1024 clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4096 clip_model_load: - kv 12: clip.vision.projection_dim u32 = 768 clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000010 clip_model_load: - kv 15: clip.vision.block_count u32 = 23 clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211] clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777] clip_model_load: - kv 18: clip.use_gelu bool = false clip_model_load: - type f32: 247 tensors clip_model_load: - type f16: 150 tensors clip_model_load: CLIP using CPU backend clip_model_load: text_encoder: 0 clip_model_load: vision_encoder: 1 clip_model_load: llava_projector: 1 clip_model_load: model size: 591.67 MB clip_model_load: metadata size: 0.15 MB clip_model_load: params backend buffer size = 591.67 MB (397 tensors) clip_model_load: compute allocated memory: 32.89 MB llama_model_loader: loaded meta data with 22 key-value pairs and 219 tensors from models/ggml-model-q4_k.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = VLM llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.block_count u32 = 24 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 16 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 16 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 14 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 49 tensors llama_model_loader: - type q4_K: 162 tensors llama_model_loader: - type q5_K: 7 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_K - Small llm_load_print_meta: model params = 1.36 B llm_load_print_meta: model size = 754.43 MiB (4.64 BPW) llm_load_print_meta: general.name = VLM llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.08 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/25 layers to GPU llm_load_tensors: CPU buffer size = 754.43 MiB ........................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: graph splits (measure): 1 llama_new_context_with_model: CPU compute buffer size = 80.00 MiB

encode_image_with_clip: image encoded in 20666.23 ms by CLIP ( 143.52 ms per image patch)

The image features a woman with long hair, dressed in an elegant hat and holding up her head. She is posing for the camera while smiling or wearing sunglasses over one eye which adds to this picture's overall style of vintage charm.

[0.09] The woman has a white dress on, looking more refined than usual.

YangYang-DLUT commented 8 months ago

The latency of CLIP here should /4 due to the limit of customized llama.cpp, which almost equals to your previous result of official llama.cpp. I was wondering whether the time was consumed by loading the weights of CLIP or the indeed encoding process. Maybe you could set some timer in the code to check the specific time cost. It is weird. Maybe there are some limitation on hardwares...

JennieGao-njust commented 8 months ago

The latency of CLIP here should /4 due to the limit of customized llama.cpp, which almost equals to your previous result of official llama.cpp. I was wondering whether the time was consumed by loading the weights of CLIP or the indeed encoding process. Maybe you could set some timer in the code to check the specific time cost. It is weird. Maybe there are some limitation on hardwares...

经过验证，encode_image_with_clip: image encoded in 20666.23 ms by CLIP ( 143.52 ms per image patch) 这里的确是mobileVLM 在骁龙gen1上的image encode时间，并且这个时间与https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/MobileVLM-README.md#some-result-on-android-with-snapdragon-888-chip 是一致的；但是与论文中的6-8ms/patches不一致； xiaolong888 这两种是因为什么造成的差异呢？

JennieGao-njust commented 8 months ago

The latency of CLIP here should /4 due to the limit of customized llama.cpp, which almost equals to your previous result of official llama.cpp. I was wondering whether the time was consumed by loading the weights of CLIP or the indeed encoding process. Maybe you could set some timer in the code to check the specific time cost. It is weird. Maybe there are some limitation on hardwares...

并且为了方便测试，我在相同的Linux环境进行了clip时间阶段的对比，如果采用mobileVLM, encode_image_with_clip: image encoded in 1319.52 ms by CLIP ( 9.16 ms per image patch) 如果采用tinyLLama,encode_image_with_clip: image encoded in 1225.62 ms by CLIP ( 2.13 ms per image patch) 虽然，LDP结构降低了token的数量，但是平均处理每个patch的时间变长了？总时间还是与未采用之前一致。这里还是很困惑，期待您的回复

JennieGao-njust commented 8 months ago

The latency of CLIP here should /4 due to the limit of customized llama.cpp, which almost equals to your previous result of official llama.cpp. I was wondering whether the time was consumed by loading the weights of CLIP or the indeed encoding process. Maybe you could set some timer in the code to check the specific time cost. It is weird. Maybe there are some limitation on hardwares...

是否跟转换的LDP大小有关？按照论文中所有LDP的参数两是20M，但是我按照说明转换出来的llava.projector 为73M；请问实际llava.projector 的大小应该是20M吗？

er-muyue commented 6 months ago

This issue has been answered in another issue #23 , please refer to it for more details.

M3Dade commented 5 months ago

@JennieGao-njust 我跑mobilevlm 1

8gen3 encode_image_with_clip: image embedding created: 144 tokens encode_image_with_clip: image encoded in 7298.78 ms by CLIP ( 50.69 ms per image patch)

888 encode_image_with_clip: image embedding created: 144 tokens encode_image_with_clip: image encoded in 9322.85 ms by CLIP ( 64.74 ms per image patch)

JennieGao-njust commented 5 months ago

9322.85 ms by CLIP ( 64.74 ms per image patch) 这里我当时还是没有弄明白我的差异在那里，并且采用tinyllama虽然token多，但是平均平均batch时间更快；现在没有在继续做了。

Meituan-AutoML / MobileVLM

关于手机上运行时间的求教 #13