Meituan-AutoML / MobileVLM

Strong and Open Vision Language Assistant for Mobile Devices
Apache License 2.0
996 stars 66 forks source link

MobileVLM_V2-3B doesn't work on llama.cpp ? #39

Open weizhou1991 opened 6 months ago

weizhou1991 commented 6 months ago

Hi, Thank you so much for you wonderful work, and I got a question that want to check , I was trying to compile on my x86 machine with compile version code, but I got erros shows below:

./test.sh clip_model_load: model name: openai/clip-vit-large-patch14-336 clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 379 clip_model_load: n_kv: 19 clip_model_load: ftype: f16

clip_model_load: loaded meta data with 19 key-value pairs and 379 tensors from MobileVLM_V2-3B/mmproj-model-f16.gguf clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. clip_model_load: - kv 0: general.architecture str = clip clip_model_load: - kv 1: clip.has_text_encoder bool = false clip_model_load: - kv 2: clip.has_vision_encoder bool = true clip_model_load: - kv 3: clip.has_llava_projector bool = true clip_model_load: - kv 4: general.file_type u32 = 1 clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336 clip_model_load: - kv 6: general.description str = image encoder for LLaVA clip_model_load: - kv 7: clip.projector_type str = peg clip_model_load: - kv 8: clip.vision.image_size u32 = 336 clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1024 clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4096 clip_model_load: - kv 12: clip.vision.projection_dim u32 = 768 clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000010 clip_model_load: - kv 15: clip.vision.block_count u32 = 23 clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211] clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777] clip_model_load: - kv 18: clip.use_gelu bool = false clip_model_load: - type f32: 236 tensors clip_model_load: - type f16: 143 tensors clip_model_load: CLIP using CPU backend clip_model_load: text_encoder: 0 clip_model_load: vision_encoder: 1 clip_model_load: llava_projector: 1 clip_model_load: model size: 573.03 MB clip_model_load: metadata size: 0.14 MB clip_model_load: params backend buffer size = 573.03 MB (379 tensors) key clip.vision.image_grid_pinpoints not found in file key clip.vision.mm_patch_merge_type not found in file key clip.vision.image_crop_resolution not found in file terminate called after throwing an instance of 'std::runtime_error' what(): get_tensor: unable to find tensor mm.mlp.0.weight

It seems like some weights lost, I think it should be assciated with this command "python ./examples/llava/convert-image-encoder-to-gguf \ -m path/to/clip-vit-large-patch14-336 \ --llava-projector path/to/MobileVLM-1.7B/llava.projector \ --output-dir path/to/MobileVLM-1.7B \ --projector-type ldp" I changed the type to peg and change to V2-3B folder and used the repo you post in one issue

Could you tell me what can I do to fix this ?

Best,

sunzhe09 commented 5 months ago

I met the same problem,my model is MobileVLM_v2-1.7B