ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.95k stars 9.62k forks source link

Bug: MiniCPM-V-2.6 commit d565bb2fd5a2a58b9924a7a34e77a87c78c52137 causing crash in moondream #9066

Closed saket424 closed 2 months ago

saket424 commented 2 months ago

What happened?

export LLAMA_CUDA=1 # only if for NViDiA CUDA export CUDA_DOCKER_ARCH=compute_86 make -j$(nproc) NVCC=/usr/local/cuda/bin/nvcc

./llama-llava-cli -m ./m2/moondream2-text-model-f16.gguf --mmproj ./m2/moondream2-mmproj-f16.gguf --image ./assets/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048

core dump

before this commit no crash

Since minicpm2.6 has a completely separate cli, i did not expect it to affect llama-llava-cli which moondream uses

Crash only observed on linux cuda and not on Mac

Name and Version

Yes crash with version 3598

No crash with ./llama-cli --version version: 3597 (ee2984bd) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

anand@nitro17:~/moondream-stuff/llama.cpp$ ./llama-llava-cli -m ./m2/moondream2-text-model-f16.gguf --mmproj ./m2/moondream2-mmproj-f16.gguf  --image ./assets/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048
Log start
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from ./m2/moondream2-text-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = moondream2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type  f16:   98 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 944
llm_load_vocab: token to piece cache size = 0.3151 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 2.64 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = moondream2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =  2706.27 MiB
................................................................................
clip_model_load: model name:   vikhyatk/moondream2
clip_model_load: description:  image encoder for vikhyatk/moondream2
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    457
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from ./m2/moondream2-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  172 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     867.61 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 50.10 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.20 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 921
llama_new_context_with_model: graph splits = 294
encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in   167.45 ms by CLIP (    0.23 ms per image patch)

 The image shows a computer server rack with multiple computer boards and components on it. The rack is placed on a carpeted floor, and there is a chair nearby. The computer boards are connected to the rack using wires, and the rack is positioned in a room with a brick wall in the background.

llama_print_timings:        load time =    1776.52 ms
llama_print_timings:      sample time =       1.56 ms /    61 runs   (    0.03 ms per token, 39203.08 tokens per second)
llama_print_timings: prompt eval time =     963.07 ms /   770 tokens (    1.25 ms per token,   799.52 tokens per second)
llama_print_timings:        eval time =    3473.04 ms /    60 runs   (   57.88 ms per token,    17.28 tokens per second)
llama_print_timings:       total time =    5310.63 ms /   830 tokens
anand@nitro17:~/moondream-stuff/llama.cpp$

anand@nitro17:~/moondream-stuff/llama.cpp$ ./llama-llava-cli -m ./m2/moondream2-text-model-f16.gguf --mmproj ./m2/moondream2-mmproj-f16.gguf  --image ./assets/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048
Log start
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from ./m2/moondream2-text-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = moondream2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type  f16:   98 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 944
llm_load_vocab: token to piece cache size = 0.3151 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 2.64 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = moondream2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =  2706.27 MiB
................................................................................
clip_model_load: model name:   vikhyatk/moondream2
clip_model_load: description:  image encoder for vikhyatk/moondream2
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    457
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from ./m2/moondream2-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  172 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     867.61 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 50.10 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.20 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 921
llama_new_context_with_model: graph splits = 294
Segmentation fault (core dumped)
arch-btw commented 2 months ago

version: 3597 (https://github.com/ggerganov/llama.cpp/commit/ee2984bdaf10c14d440ad873a049bcc09b786d9b)

I think that was the commit before MiniCPM-V-2.6 got merged. So it might be something else.

saket424 commented 2 months ago

version 3597 works and version 3598 bombs. i narrowed it down. it should be easy enough for someone to reproduce this

LostRuins commented 2 months ago

Can confirm it's broken for llava. It seems to work intermittently, probably some out of bounds memory access.

fairydreaming commented 2 months ago

It crashes in this assert located in GGML get rows operation:

(gdb) 
#7  0x0000555555632eb2 in ggml_compute_forward_get_rows_f32 (params=0x7ffedd080ce0, dst=0x555555ce80b0)
    at ggml/src/ggml.c:13345
13345           assert(i01 >= 0 && i01 < ne01);
(gdb) print i01
$1 = 729
(gdb) print ne01
$2 = 729

The direct cause is that the index in get rows operation is outside the valid range. I noticed that dst->src[1] is named patches, so I think it's the one created here:

https://github.com/ggerganov/llama.cpp/blob/554b049068de24201d19dde2fa83e35389d4585d/examples/llava/clip.cpp#L2418-L2426

Note the i + 1 in this loop. I made the following change:

diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 342042ff..224db9b5 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -2419,7 +2419,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
             struct ggml_tensor * patches = ggml_graph_get_tensor(gf, "patches");
             int* patches_data = (int*)malloc(ggml_nbytes(patches));
             for (int i = 0; i < num_patches; i++) {
-                patches_data[i] = i + 1;
+                patches_data[i] = i;
             }
             ggml_backend_tensor_set(patches, patches_data, 0, ggml_nbytes(patches));
             free(patches_data);

And it no longer crashes:

(base) phm@epyc:~/projects/llama.cpp$ ./llama-llava-cli --numa distribute -t 32 -m /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-text-model-f16.gguf --mmproj /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf --image ~/Downloads/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048
Log start
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-text-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = moondream2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type  f16:   98 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 944
llm_load_vocab: token to piece cache size = 0.3151 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 2.64 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = moondream2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  2706.27 MiB
................................................................................
clip_model_load: model name:   vikhyatk/moondream2
clip_model_load: description:  image encoder for vikhyatk/moondream2
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    457
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  172 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     867.61 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 50.10 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.20 MiB
llama_new_context_with_model:        CPU compute buffer size =   160.01 MiB
llama_new_context_with_model: graph nodes  = 921
llama_new_context_with_model: graph splits = 1
encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in   422.27 ms by CLIP (    0.58 ms per image patch)

 The image shows a computer cooling rack with several computer parts on it. The rack is placed on a carpeted floor, and there is a couch in the background. The computer parts include a large black computer tower, multiple computer fans, and various other components. The rack is filled with these parts, indicating that it is likely being used for assembling or disassembling computer systems.

llama_print_timings:        load time =    4348.98 ms
llama_print_timings:      sample time =       2.08 ms /    77 runs   (    0.03 ms per token, 36930.46 tokens per second)
llama_print_timings: prompt eval time =    3340.20 ms /   770 tokens (    4.34 ms per token,   230.53 tokens per second)
llama_print_timings:        eval time =    1033.80 ms /    76 runs   (   13.60 ms per token,    73.51 tokens per second)
llama_print_timings:       total time =    5400.74 ms /   846 tokens

I guess the question remains why it worked before and now it doesn't? I have no idea yet :/

LostRuins commented 2 months ago

That does seem to fix it, although I can't be sure. On first glance llava 1.5 no longer crashes.

The crash was very inconsistent, probably because sometimes this off-by-one access wasn't actually out of bounds memory (maybe due to padding?).

It was extra weird because adding a simple printf before calls to clip_is_minicpmv would prevent it from crashing as well. I suspect this issue was already present for quite some time.

LostRuins commented 2 months ago

Edit: nope, I think this does not solve the issue. I am still getting intermittent segfaults

fairydreaming commented 2 months ago

@LostRuins I just tried release builds, in my case only the debug builds (LLAMA_DEBUG=1) crashed on this assert, release build worked without problems. So this may be an entirely unrelated problem after all. I can't reproduce crashes in release builds.

LostRuins commented 2 months ago

I'm not hitting any assert. I am getting a segmentation fault

exception: access violation reading 0x0000657669736E65

Adding the abovementioned print statements before every call to clip_is_minicpmv (temporarily) resolves this, but that's not a proper solution - there's definitely still some out of bounds access going on.

fairydreaming commented 2 months ago

It finally crashed. I guess the important part is LLAMA_CUDA=1.

slaren commented 2 months ago

I guess the question remains why it worked before and now it doesn't? I have no idea yet :/

This assert was added fairly recently (in #6210), so previously this wouldn't be noticed even in debug builds. It would cause wrong data to be returned, but since more tensors are allocated in the same buffer, it is not likely to cause it to crash with an invalid access. It looks like a logic error in the clip implementation, and it may have affected the quality of the generation.

fairydreaming commented 2 months ago

Try this:

diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 342042ff..8ce4add1 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -1108,7 +1108,7 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
         }
     }

-    clip_ctx * new_clip = new clip_ctx;
+    clip_ctx * new_clip = new clip_ctx{};

     // update projector type
     {

I noticed that the default constructor of clip_ctx didn't initialize the fields, so they were basically all filled with garbage:

Thread 1 "llama-llava-cli" hit Breakpoint 6, clip_model_load (fname=0x55556334d210 "/mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf", verbosity=1) at examples/llava/clip.cpp:1111
1111        clip_ctx * new_clip = new clip_ctx;
(gdb) n
1115            int idx = gguf_find_key(ctx, KEY_PROJ_TYPE);
(gdb) print *new_clip
$10 = {has_text_encoder = false, has_vision_encoder = false, has_llava_projector = false, has_minicpmv_projector = false, 
  minicpmv_version = 2, vision_model = {hparams = {image_size = -283352320, patch_size = 32767, hidden_size = 1664890336, 
      n_intermediate = 21845, projection_dim = 1664890336, n_head = 21845, n_layer = 779485184, eps = 5.3529077e-11, 
      mm_patch_merge_type = "flat", '\000' <repeats 27 times>, image_grid_pinpoints = {0, 1, 854052736, 0, 20, 0, 1818373750, 
        909258347, 1953784110, 779509614, 1935763810, 1, 1152, 0, 0, 856706944, 0, 24, 0, 1818373750, 909258347, 1953784110, 
        1970233198, 1702309492, 1952999273, 2, 1152, 0, 1152, 0, 1, 856711552}, image_crop_resolution = 0}, 
    class_embedding = 0x6c622e7600000000, patch_embeddings = 0x7474612e36322e6b, patch_bias = 0x69622e74756f5f6e, 
    position_embeddings = 0x480000000017361, pre_ln_w = 0x0, pre_ln_b = 0x3338e1800000, 
    layers = std::vector of length 0, capacity 0, post_ln_w = 0x17468676965, post_ln_b = 0x48000, 
    projection = 0x38f3800000000000, mm_0_w = 0x0, mm_0_b = 0x0, mm_2_w = 0x0, mm_2_b = 0x0, image_newline = 0x0, 
    mm_1_w = 0x0, mm_1_b = 0x0, mm_3_w = 0x0, mm_3_b = 0x0, mm_4_w = 0x0, mm_4_b = 0x0, mm_model_mlp_1_w = 0x4800000, 
    mm_model_mlp_1_b = 0x10d00000, mm_model_mlp_3_w = 0x1780000000010000, mm_model_mlp_3_b = 0x16000000003339, 
    mm_model_block_1_block_0_0_w = 0x2e76000000000000, mm_model_block_1_block_0_1_w = 0x662e36322e6b6c62, 
    mm_model_block_1_block_0_1_b = 0x2e6e776f645f6e66, mm_model_block_1_block_1_fc1_w = 0x173616962, 
    mm_model_block_1_block_1_fc1_b = 0x10d0, mm_model_block_1_block_1_fc2_w = 0x33d0678000000000, 
    mm_model_block_1_block_1_fc2_b = 0x1600000000, mm_model_block_1_block_2_0_w = 0x6c622e7600000000, 
    mm_model_block_1_block_2_1_w = 0x6e66662e36322e6b, mm_model_block_1_block_2_1_b = 0x676965772e70755f, 
    mm_model_block_2_block_0_0_w = 0x10d0000000027468, mm_model_block_2_block_0_1_w = 0x480000000000000, 
    mm_model_block_2_block_0_1_b = 0x1000000000000, mm_model_block_2_block_1_fc1_w = 0x33d0aac00000, 
    mm_model_block_2_block_1_fc1_b = 0x140000, mm_model_block_2_block_1_fc2_w = 0x2e6b6c622e760000, 
    mm_model_block_2_block_1_fc2_b = 0x755f6e66662e3632, mm_model_block_2_block_2_0_w = 0x1736169622e70, 
    mm_model_block_2_block_2_1_w = 0x4800000, mm_model_block_2_block_2_1_b = 0xfac0000000000000, 
    mm_model_mlp_0_w = 0x13000000003467, mm_model_mlp_0_b = 0x2e76000000000000, mm_model_mlp_2_w = 0x6c2e36322e6b6c62, 
    mm_model_mlp_2_b = 0x68676965772e326e, mm_model_peg_0_w = 0x4800000000174, mm_model_peg_0_b = 0x0, 
    mm_model_pos_embed_k = 0x34680cc000, mm_model_query = 0x1100, mm_model_proj = 0x322e6b6c622e7600, 
    mm_model_kv_proj = 0x69622e326e6c2e36, mm_model_attn_q_w = 0x480000000017361, mm_model_attn_q_b = 0x0, 
    mm_model_attn_k_w = 0x34681ec00000, mm_model_attn_k_b = 0x160000, mm_model_attn_v_w = 0x2e6b6c622e760000, 
    mm_model_attn_v_b = 0x5f6e7474612e3732, mm_model_attn_o_w = 0x7468676965772e71, mm_model_attn_o_b = 0x48000000002, 
    mm_model_ln_q_w = 0x48000000000, mm_model_ln_q_b = 0x100000000, mm_model_ln_kv_w = 0x346830c0, mm_model_ln_kv_b = 0x14, 
    mm_model_ln_post_w = 0x37322e6b6c622e76, mm_model_ln_post_b = 0x2e715f6e7474612e}, proj_type = PROJECTOR_TYPE_MLP, 
  image_mean = {1.40129846e-45, 1.61429583e-42, 0}, image_std = {0, 2.69506927e-07, 0}, use_gelu = false, ftype = 1, 
--Type <RET> for more, q to quit, c to continue without paging--
  has_class_embedding = true, has_pre_norm = true, has_post_norm = false, has_patch_bias = false, 
  ctx_gguf = 0x7474612e37322e6b, ctx_data = 0x676965772e6b5f6e, buf_compute_meta = std::vector of length 0, capacity 0, 
  params_buffer = 0x0, backend = 0x0, compute_alloc = 0x0, load_image_size = 0x5f6e7474612e3732}

After the change:

Thread 1 "llama-llava-cli" hit Breakpoint 6, clip_model_load (fname=0x55556334d210 "/mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf", verbosity=1) at examples/llava/clip.cpp:1111
1111        clip_ctx * new_clip = new clip_ctx{};
(gdb) n
1115            int idx = gguf_find_key(ctx, KEY_PROJ_TYPE);
(gdb) print *new_clip
$11 = {has_text_encoder = false, has_vision_encoder = false, has_llava_projector = false, has_minicpmv_projector = false, 
  minicpmv_version = 2, vision_model = {hparams = {image_size = 0, patch_size = 0, hidden_size = 0, n_intermediate = 0, 
      projection_dim = 0, n_head = 0, n_layer = 0, eps = 0, mm_patch_merge_type = "flat", '\000' <repeats 27 times>, 
      image_grid_pinpoints = {0 <repeats 32 times>}, image_crop_resolution = 0}, class_embedding = 0x0, 
    patch_embeddings = 0x0, patch_bias = 0x0, position_embeddings = 0x0, pre_ln_w = 0x0, pre_ln_b = 0x0, 
    layers = std::vector of length 0, capacity 0, post_ln_w = 0x0, post_ln_b = 0x0, projection = 0x0, mm_0_w = 0x0, 
    mm_0_b = 0x0, mm_2_w = 0x0, mm_2_b = 0x0, image_newline = 0x0, mm_1_w = 0x0, mm_1_b = 0x0, mm_3_w = 0x0, mm_3_b = 0x0, 
    mm_4_w = 0x0, mm_4_b = 0x0, mm_model_mlp_1_w = 0x0, mm_model_mlp_1_b = 0x0, mm_model_mlp_3_w = 0x0, 
    mm_model_mlp_3_b = 0x0, mm_model_block_1_block_0_0_w = 0x0, mm_model_block_1_block_0_1_w = 0x0, 
    mm_model_block_1_block_0_1_b = 0x0, mm_model_block_1_block_1_fc1_w = 0x0, mm_model_block_1_block_1_fc1_b = 0x0, 
    mm_model_block_1_block_1_fc2_w = 0x0, mm_model_block_1_block_1_fc2_b = 0x0, mm_model_block_1_block_2_0_w = 0x0, 
    mm_model_block_1_block_2_1_w = 0x0, mm_model_block_1_block_2_1_b = 0x0, mm_model_block_2_block_0_0_w = 0x0, 
    mm_model_block_2_block_0_1_w = 0x0, mm_model_block_2_block_0_1_b = 0x0, mm_model_block_2_block_1_fc1_w = 0x0, 
    mm_model_block_2_block_1_fc1_b = 0x0, mm_model_block_2_block_1_fc2_w = 0x0, mm_model_block_2_block_1_fc2_b = 0x0, 
    mm_model_block_2_block_2_0_w = 0x0, mm_model_block_2_block_2_1_w = 0x0, mm_model_block_2_block_2_1_b = 0x0, 
    mm_model_mlp_0_w = 0x0, mm_model_mlp_0_b = 0x0, mm_model_mlp_2_w = 0x0, mm_model_mlp_2_b = 0x0, mm_model_peg_0_w = 0x0, 
    mm_model_peg_0_b = 0x0, mm_model_pos_embed_k = 0x0, mm_model_query = 0x0, mm_model_proj = 0x0, mm_model_kv_proj = 0x0, 
    mm_model_attn_q_w = 0x0, mm_model_attn_q_b = 0x0, mm_model_attn_k_w = 0x0, mm_model_attn_k_b = 0x0, 
    mm_model_attn_v_w = 0x0, mm_model_attn_v_b = 0x0, mm_model_attn_o_w = 0x0, mm_model_attn_o_b = 0x0, 
    mm_model_ln_q_w = 0x0, mm_model_ln_q_b = 0x0, mm_model_ln_kv_w = 0x0, mm_model_ln_kv_b = 0x0, mm_model_ln_post_w = 0x0, 
    mm_model_ln_post_b = 0x0}, proj_type = PROJECTOR_TYPE_MLP, image_mean = {0, 0, 0}, image_std = {0, 0, 0}, 
  use_gelu = false, ftype = 1, has_class_embedding = true, has_pre_norm = true, has_post_norm = false, 
  has_patch_bias = false, ctx_gguf = 0x0, ctx_data = 0x0, buf_compute_meta = std::vector of length 0, capacity 0, 
  params_buffer = 0x0, backend = 0x0, compute_alloc = 0x0, load_image_size = 0x0}
saket424 commented 2 months ago

@fairydreaming That one line change fixed it

saket424 commented 2 months ago

@fairydreaming Not specifically related to your fix, I just noticed it is not offloading any layers to the GPU. Is this normal?

ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/25 layers to GPU llm_load_tensors: CPU buffer size = 2706.27 MiB

fairydreaming commented 2 months ago

@saket424 yeah, I didn't use -ngl option, so it didn't offload any layers.

fairydreaming commented 2 months ago

@monatis can you take a look at this code:

https://github.com/ggerganov/llama.cpp/blob/1b6ff90ff8301d9fe2027be2bb9fea26177d775e/examples/llava/clip.cpp#L2418-L2426

I think it's a rewritten form of your original llava 1.5 code:

        struct ggml_tensor * patches = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, num_patches);
        ggml_allocr_alloc(ctx->alloc, patches);
        if (!ggml_allocr_is_measure(ctx->alloc)) {
            for (int i = 0; i < num_patches; ++i) {
                ggml_set_i32_1d(patches, i, i+1);
            }
        }

Do you remember what is the purpose of i + 1? Is it related to vision feature select strategy? I found the following in transformers library:

https://github.com/huggingface/transformers/blob/52cb4034ada381fe1ffe8d428a1076e5411a8026/src/transformers/models/llava/modeling_llava.py#L450-L456

(note selected_image_feature[:, 1:] when vision_feature_select_strategy is default)

Since i increases from 0 to num_patches - 1, i + 1 will have value num_patches at the end that is outside the valid range of embeddings tensor dimension and causes assertion failure in GGML get rows operation.