Closed saket424 closed 2 months ago
version: 3597 (https://github.com/ggerganov/llama.cpp/commit/ee2984bdaf10c14d440ad873a049bcc09b786d9b)
I think that was the commit before MiniCPM-V-2.6 got merged. So it might be something else.
version 3597 works and version 3598 bombs. i narrowed it down. it should be easy enough for someone to reproduce this
Can confirm it's broken for llava. It seems to work intermittently, probably some out of bounds memory access.
It crashes in this assert located in GGML get rows operation:
(gdb)
#7 0x0000555555632eb2 in ggml_compute_forward_get_rows_f32 (params=0x7ffedd080ce0, dst=0x555555ce80b0)
at ggml/src/ggml.c:13345
13345 assert(i01 >= 0 && i01 < ne01);
(gdb) print i01
$1 = 729
(gdb) print ne01
$2 = 729
The direct cause is that the index in get rows operation is outside the valid range. I noticed that dst->src[1]
is named patches
, so I think it's the one created here:
Note the i + 1
in this loop. I made the following change:
diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 342042ff..224db9b5 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -2419,7 +2419,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
struct ggml_tensor * patches = ggml_graph_get_tensor(gf, "patches");
int* patches_data = (int*)malloc(ggml_nbytes(patches));
for (int i = 0; i < num_patches; i++) {
- patches_data[i] = i + 1;
+ patches_data[i] = i;
}
ggml_backend_tensor_set(patches, patches_data, 0, ggml_nbytes(patches));
free(patches_data);
And it no longer crashes:
(base) phm@epyc:~/projects/llama.cpp$ ./llama-llava-cli --numa distribute -t 32 -m /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-text-model-f16.gguf --mmproj /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf --image ~/Downloads/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048
Log start
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-text-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi2
llama_model_loader: - kv 1: general.name str = moondream2
llama_model_loader: - kv 2: phi2.context_length u32 = 2048
llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048
llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192
llama_model_loader: - kv 5: phi2.block_count u32 = 24
llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
llama_model_loader: - type f32: 147 tensors
llama_model_loader: - type f16: 98 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens cache size = 944
llm_load_vocab: token to piece cache size = 0.3151 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 51200
llm_load_print_meta: n_merges = 50000
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.42 B
llm_load_print_meta: model size = 2.64 GiB (16.01 BPW)
llm_load_print_meta: general.name = moondream2
llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: CPU buffer size = 2706.27 MiB
................................................................................
clip_model_load: model name: vikhyatk/moondream2
clip_model_load: description: image encoder for vikhyatk/moondream2
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 457
clip_model_load: n_kv: 19
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.name str = vikhyatk/moondream2
clip_model_load: - kv 6: general.description str = image encoder for vikhyatk/moondream2
clip_model_load: - kv 7: clip.projector_type str = mlp
clip_model_load: - kv 8: clip.vision.image_size u32 = 378
clip_model_load: - kv 9: clip.vision.patch_size u32 = 14
clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 12: clip.vision.projection_dim u32 = 2048
clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 15: clip.vision.block_count u32 = 28
clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 18: clip.use_gelu bool = true
clip_model_load: - type f32: 285 tensors
clip_model_load: - type f16: 172 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: minicpmv_projector: 0
clip_model_load: model size: 867.61 MB
clip_model_load: metadata size: 0.16 MB
clip_model_load: params backend buffer size = 867.61 MB (457 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 50.10 MB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 384.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.20 MiB
llama_new_context_with_model: CPU compute buffer size = 160.01 MiB
llama_new_context_with_model: graph nodes = 921
llama_new_context_with_model: graph splits = 1
encode_image_with_clip: image embedding created: 729 tokens
encode_image_with_clip: image encoded in 422.27 ms by CLIP ( 0.58 ms per image patch)
The image shows a computer cooling rack with several computer parts on it. The rack is placed on a carpeted floor, and there is a couch in the background. The computer parts include a large black computer tower, multiple computer fans, and various other components. The rack is filled with these parts, indicating that it is likely being used for assembling or disassembling computer systems.
llama_print_timings: load time = 4348.98 ms
llama_print_timings: sample time = 2.08 ms / 77 runs ( 0.03 ms per token, 36930.46 tokens per second)
llama_print_timings: prompt eval time = 3340.20 ms / 770 tokens ( 4.34 ms per token, 230.53 tokens per second)
llama_print_timings: eval time = 1033.80 ms / 76 runs ( 13.60 ms per token, 73.51 tokens per second)
llama_print_timings: total time = 5400.74 ms / 846 tokens
I guess the question remains why it worked before and now it doesn't? I have no idea yet :/
That does seem to fix it, although I can't be sure. On first glance llava 1.5 no longer crashes.
The crash was very inconsistent, probably because sometimes this off-by-one access wasn't actually out of bounds memory (maybe due to padding?).
It was extra weird because adding a simple printf
before calls to clip_is_minicpmv
would prevent it from crashing as well. I suspect this issue was already present for quite some time.
Edit: nope, I think this does not solve the issue. I am still getting intermittent segfaults
@LostRuins I just tried release builds, in my case only the debug builds (LLAMA_DEBUG=1) crashed on this assert, release build worked without problems. So this may be an entirely unrelated problem after all. I can't reproduce crashes in release builds.
I'm not hitting any assert. I am getting a segmentation fault
exception: access violation reading 0x0000657669736E65
Adding the abovementioned print statements before every call to clip_is_minicpmv
(temporarily) resolves this, but that's not a proper solution - there's definitely still some out of bounds access going on.
It finally crashed. I guess the important part is LLAMA_CUDA=1.
I guess the question remains why it worked before and now it doesn't? I have no idea yet :/
This assert was added fairly recently (in #6210), so previously this wouldn't be noticed even in debug builds. It would cause wrong data to be returned, but since more tensors are allocated in the same buffer, it is not likely to cause it to crash with an invalid access. It looks like a logic error in the clip implementation, and it may have affected the quality of the generation.
Try this:
diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 342042ff..8ce4add1 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -1108,7 +1108,7 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
}
}
- clip_ctx * new_clip = new clip_ctx;
+ clip_ctx * new_clip = new clip_ctx{};
// update projector type
{
I noticed that the default constructor of clip_ctx didn't initialize the fields, so they were basically all filled with garbage:
Thread 1 "llama-llava-cli" hit Breakpoint 6, clip_model_load (fname=0x55556334d210 "/mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf", verbosity=1) at examples/llava/clip.cpp:1111
1111 clip_ctx * new_clip = new clip_ctx;
(gdb) n
1115 int idx = gguf_find_key(ctx, KEY_PROJ_TYPE);
(gdb) print *new_clip
$10 = {has_text_encoder = false, has_vision_encoder = false, has_llava_projector = false, has_minicpmv_projector = false,
minicpmv_version = 2, vision_model = {hparams = {image_size = -283352320, patch_size = 32767, hidden_size = 1664890336,
n_intermediate = 21845, projection_dim = 1664890336, n_head = 21845, n_layer = 779485184, eps = 5.3529077e-11,
mm_patch_merge_type = "flat", '\000' <repeats 27 times>, image_grid_pinpoints = {0, 1, 854052736, 0, 20, 0, 1818373750,
909258347, 1953784110, 779509614, 1935763810, 1, 1152, 0, 0, 856706944, 0, 24, 0, 1818373750, 909258347, 1953784110,
1970233198, 1702309492, 1952999273, 2, 1152, 0, 1152, 0, 1, 856711552}, image_crop_resolution = 0},
class_embedding = 0x6c622e7600000000, patch_embeddings = 0x7474612e36322e6b, patch_bias = 0x69622e74756f5f6e,
position_embeddings = 0x480000000017361, pre_ln_w = 0x0, pre_ln_b = 0x3338e1800000,
layers = std::vector of length 0, capacity 0, post_ln_w = 0x17468676965, post_ln_b = 0x48000,
projection = 0x38f3800000000000, mm_0_w = 0x0, mm_0_b = 0x0, mm_2_w = 0x0, mm_2_b = 0x0, image_newline = 0x0,
mm_1_w = 0x0, mm_1_b = 0x0, mm_3_w = 0x0, mm_3_b = 0x0, mm_4_w = 0x0, mm_4_b = 0x0, mm_model_mlp_1_w = 0x4800000,
mm_model_mlp_1_b = 0x10d00000, mm_model_mlp_3_w = 0x1780000000010000, mm_model_mlp_3_b = 0x16000000003339,
mm_model_block_1_block_0_0_w = 0x2e76000000000000, mm_model_block_1_block_0_1_w = 0x662e36322e6b6c62,
mm_model_block_1_block_0_1_b = 0x2e6e776f645f6e66, mm_model_block_1_block_1_fc1_w = 0x173616962,
mm_model_block_1_block_1_fc1_b = 0x10d0, mm_model_block_1_block_1_fc2_w = 0x33d0678000000000,
mm_model_block_1_block_1_fc2_b = 0x1600000000, mm_model_block_1_block_2_0_w = 0x6c622e7600000000,
mm_model_block_1_block_2_1_w = 0x6e66662e36322e6b, mm_model_block_1_block_2_1_b = 0x676965772e70755f,
mm_model_block_2_block_0_0_w = 0x10d0000000027468, mm_model_block_2_block_0_1_w = 0x480000000000000,
mm_model_block_2_block_0_1_b = 0x1000000000000, mm_model_block_2_block_1_fc1_w = 0x33d0aac00000,
mm_model_block_2_block_1_fc1_b = 0x140000, mm_model_block_2_block_1_fc2_w = 0x2e6b6c622e760000,
mm_model_block_2_block_1_fc2_b = 0x755f6e66662e3632, mm_model_block_2_block_2_0_w = 0x1736169622e70,
mm_model_block_2_block_2_1_w = 0x4800000, mm_model_block_2_block_2_1_b = 0xfac0000000000000,
mm_model_mlp_0_w = 0x13000000003467, mm_model_mlp_0_b = 0x2e76000000000000, mm_model_mlp_2_w = 0x6c2e36322e6b6c62,
mm_model_mlp_2_b = 0x68676965772e326e, mm_model_peg_0_w = 0x4800000000174, mm_model_peg_0_b = 0x0,
mm_model_pos_embed_k = 0x34680cc000, mm_model_query = 0x1100, mm_model_proj = 0x322e6b6c622e7600,
mm_model_kv_proj = 0x69622e326e6c2e36, mm_model_attn_q_w = 0x480000000017361, mm_model_attn_q_b = 0x0,
mm_model_attn_k_w = 0x34681ec00000, mm_model_attn_k_b = 0x160000, mm_model_attn_v_w = 0x2e6b6c622e760000,
mm_model_attn_v_b = 0x5f6e7474612e3732, mm_model_attn_o_w = 0x7468676965772e71, mm_model_attn_o_b = 0x48000000002,
mm_model_ln_q_w = 0x48000000000, mm_model_ln_q_b = 0x100000000, mm_model_ln_kv_w = 0x346830c0, mm_model_ln_kv_b = 0x14,
mm_model_ln_post_w = 0x37322e6b6c622e76, mm_model_ln_post_b = 0x2e715f6e7474612e}, proj_type = PROJECTOR_TYPE_MLP,
image_mean = {1.40129846e-45, 1.61429583e-42, 0}, image_std = {0, 2.69506927e-07, 0}, use_gelu = false, ftype = 1,
--Type <RET> for more, q to quit, c to continue without paging--
has_class_embedding = true, has_pre_norm = true, has_post_norm = false, has_patch_bias = false,
ctx_gguf = 0x7474612e37322e6b, ctx_data = 0x676965772e6b5f6e, buf_compute_meta = std::vector of length 0, capacity 0,
params_buffer = 0x0, backend = 0x0, compute_alloc = 0x0, load_image_size = 0x5f6e7474612e3732}
After the change:
Thread 1 "llama-llava-cli" hit Breakpoint 6, clip_model_load (fname=0x55556334d210 "/mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf", verbosity=1) at examples/llava/clip.cpp:1111
1111 clip_ctx * new_clip = new clip_ctx{};
(gdb) n
1115 int idx = gguf_find_key(ctx, KEY_PROJ_TYPE);
(gdb) print *new_clip
$11 = {has_text_encoder = false, has_vision_encoder = false, has_llava_projector = false, has_minicpmv_projector = false,
minicpmv_version = 2, vision_model = {hparams = {image_size = 0, patch_size = 0, hidden_size = 0, n_intermediate = 0,
projection_dim = 0, n_head = 0, n_layer = 0, eps = 0, mm_patch_merge_type = "flat", '\000' <repeats 27 times>,
image_grid_pinpoints = {0 <repeats 32 times>}, image_crop_resolution = 0}, class_embedding = 0x0,
patch_embeddings = 0x0, patch_bias = 0x0, position_embeddings = 0x0, pre_ln_w = 0x0, pre_ln_b = 0x0,
layers = std::vector of length 0, capacity 0, post_ln_w = 0x0, post_ln_b = 0x0, projection = 0x0, mm_0_w = 0x0,
mm_0_b = 0x0, mm_2_w = 0x0, mm_2_b = 0x0, image_newline = 0x0, mm_1_w = 0x0, mm_1_b = 0x0, mm_3_w = 0x0, mm_3_b = 0x0,
mm_4_w = 0x0, mm_4_b = 0x0, mm_model_mlp_1_w = 0x0, mm_model_mlp_1_b = 0x0, mm_model_mlp_3_w = 0x0,
mm_model_mlp_3_b = 0x0, mm_model_block_1_block_0_0_w = 0x0, mm_model_block_1_block_0_1_w = 0x0,
mm_model_block_1_block_0_1_b = 0x0, mm_model_block_1_block_1_fc1_w = 0x0, mm_model_block_1_block_1_fc1_b = 0x0,
mm_model_block_1_block_1_fc2_w = 0x0, mm_model_block_1_block_1_fc2_b = 0x0, mm_model_block_1_block_2_0_w = 0x0,
mm_model_block_1_block_2_1_w = 0x0, mm_model_block_1_block_2_1_b = 0x0, mm_model_block_2_block_0_0_w = 0x0,
mm_model_block_2_block_0_1_w = 0x0, mm_model_block_2_block_0_1_b = 0x0, mm_model_block_2_block_1_fc1_w = 0x0,
mm_model_block_2_block_1_fc1_b = 0x0, mm_model_block_2_block_1_fc2_w = 0x0, mm_model_block_2_block_1_fc2_b = 0x0,
mm_model_block_2_block_2_0_w = 0x0, mm_model_block_2_block_2_1_w = 0x0, mm_model_block_2_block_2_1_b = 0x0,
mm_model_mlp_0_w = 0x0, mm_model_mlp_0_b = 0x0, mm_model_mlp_2_w = 0x0, mm_model_mlp_2_b = 0x0, mm_model_peg_0_w = 0x0,
mm_model_peg_0_b = 0x0, mm_model_pos_embed_k = 0x0, mm_model_query = 0x0, mm_model_proj = 0x0, mm_model_kv_proj = 0x0,
mm_model_attn_q_w = 0x0, mm_model_attn_q_b = 0x0, mm_model_attn_k_w = 0x0, mm_model_attn_k_b = 0x0,
mm_model_attn_v_w = 0x0, mm_model_attn_v_b = 0x0, mm_model_attn_o_w = 0x0, mm_model_attn_o_b = 0x0,
mm_model_ln_q_w = 0x0, mm_model_ln_q_b = 0x0, mm_model_ln_kv_w = 0x0, mm_model_ln_kv_b = 0x0, mm_model_ln_post_w = 0x0,
mm_model_ln_post_b = 0x0}, proj_type = PROJECTOR_TYPE_MLP, image_mean = {0, 0, 0}, image_std = {0, 0, 0},
use_gelu = false, ftype = 1, has_class_embedding = true, has_pre_norm = true, has_post_norm = false,
has_patch_bias = false, ctx_gguf = 0x0, ctx_data = 0x0, buf_compute_meta = std::vector of length 0, capacity 0,
params_buffer = 0x0, backend = 0x0, compute_alloc = 0x0, load_image_size = 0x0}
@fairydreaming That one line change fixed it
@fairydreaming Not specifically related to your fix, I just noticed it is not offloading any layers to the GPU. Is this normal?
ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/25 layers to GPU llm_load_tensors: CPU buffer size = 2706.27 MiB
@saket424 yeah, I didn't use -ngl option, so it didn't offload any layers.
@monatis can you take a look at this code:
I think it's a rewritten form of your original llava 1.5 code:
struct ggml_tensor * patches = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, num_patches);
ggml_allocr_alloc(ctx->alloc, patches);
if (!ggml_allocr_is_measure(ctx->alloc)) {
for (int i = 0; i < num_patches; ++i) {
ggml_set_i32_1d(patches, i, i+1);
}
}
Do you remember what is the purpose of i + 1? Is it related to vision feature select strategy? I found the following in transformers library:
(note selected_image_feature[:, 1:]
when vision_feature_select_strategy
is default
)
Since i
increases from 0
to num_patches - 1
, i + 1
will have value num_patches
at the end that is outside the valid range of embeddings tensor dimension and causes assertion failure in GGML get rows operation.
What happened?
export LLAMA_CUDA=1 # only if for NViDiA CUDA export CUDA_DOCKER_ARCH=compute_86 make -j$(nproc) NVCC=/usr/local/cuda/bin/nvcc
./llama-llava-cli -m ./m2/moondream2-text-model-f16.gguf --mmproj ./m2/moondream2-mmproj-f16.gguf --image ./assets/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048
core dump
before this commit no crash
Since minicpm2.6 has a completely separate cli, i did not expect it to affect llama-llava-cli which moondream uses
Crash only observed on linux cuda and not on Mac
Name and Version
Yes crash with version 3598
No crash with ./llama-cli --version version: 3597 (ee2984bd) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output