Closed Luke100000 closed 3 weeks ago
When using an embedding model via Ollamas API, Lama.cpp has an assertion error: Bug: Assertion '__n < this->size()' failed.
Bug: Assertion '__n < this->size()' failed.
I tried nomic-embed-text-v1.5 and all-minilm. It works fine if 100% CPU.
0.3.6 ollama-cuda from AUR, was not able to find the used lamacpp version.
Linux
Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.840+02:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-e919e64e-b05e-1b0e-79fe-4d6f163c34c8 parallel=4 available=11899699200 required="1.0 GiB" Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.840+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[11.1 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.842+02:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama140604727/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 4 --port 35395" Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.842+02:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.842+02:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding" Sep 25 15:40:18 hostname ollama[268657]: time=2024-09-25T15:40:18.842+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error" Sep 25 15:40:18 hostname ollama[269097]: INFO [main] build info | build=3535 commit="1e6f6554a" tid="140699372605440" timestamp=1727271618 Sep 25 15:40:18 hostname ollama[269097]: INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140699372605440" timestamp=1727271618 total_threads=12 Sep 25 15:40:18 hostname ollama[269097]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="35395" tid="140699372605440" timestamp=1727271618 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest)) Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 0: general.architecture str = nomic-bert Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 8: general.file_type u32 = 1 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 15: tokenizer.ggml.model str = bert Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103 Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - type f32: 51 tensors Sep 25 15:40:18 hostname ollama[268657]: llama_model_loader: - type f16: 61 tensors Sep 25 15:40:18 hostname ollama[268657]: llm_load_vocab: special tokens cache size = 5 Sep 25 15:40:18 hostname ollama[268657]: llm_load_vocab: token to piece cache size = 0.2032 MB Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: format = GGUF V3 (latest) Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: arch = nomic-bert Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: vocab type = WPM Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_vocab = 30522 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_merges = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: vocab_only = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_ctx_train = 2048 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd = 768 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_layer = 12 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_head = 12 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_head_kv = 12 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_rot = 64 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_swa = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd_head_k = 64 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd_head_v = 64 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_gqa = 1 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd_k_gqa = 768 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_embd_v_gqa = 768 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_norm_eps = 1.0e-12 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_ff = 3072 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_expert = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_expert_used = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: causal attn = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: pooling type = 1 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: rope type = 2 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: rope scaling = linear Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: freq_base_train = 1000.0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: freq_scale_train = 1 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: n_ctx_orig_yarn = 2048 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: rope_finetuned = unknown Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: ssm_d_conv = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: ssm_d_inner = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: ssm_d_state = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: ssm_dt_rank = 0 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: model type = 137M Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: model ftype = F16 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: model params = 136.73 M Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: model size = 260.86 MiB (16.00 BPW) Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: general.name = nomic-embed-text-v1.5 Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: BOS token = 101 '[CLS]' Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: EOS token = 102 '[SEP]' Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: UNK token = 100 '[UNK]' Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: SEP token = 102 '[SEP]' Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: PAD token = 0 '[PAD]' Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: CLS token = 101 '[CLS]' Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: MASK token = 103 '[MASK]' Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: LF token = 0 '[PAD]' Sep 25 15:40:18 hostname ollama[268657]: llm_load_print_meta: max token length = 21 Sep 25 15:40:18 hostname ollama[268657]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 25 15:40:18 hostname ollama[268657]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 25 15:40:18 hostname ollama[268657]: ggml_cuda_init: found 1 CUDA devices: Sep 25 15:40:18 hostname ollama[268657]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Sep 25 15:40:18 hostname ollama[268657]: llm_load_tensors: ggml ctx size = 0.10 MiB Sep 25 15:40:19 hostname ollama[268657]: time=2024-09-25T15:40:19.093+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors: offloading 12 repeating layers to GPU Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors: offloading non-repeating layers to GPU Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors: offloaded 13/13 layers to GPU Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors: CPU buffer size = 44.72 MiB Sep 25 15:40:19 hostname ollama[268657]: llm_load_tensors: CUDA0 buffer size = 216.15 MiB Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: n_ctx = 32768 Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: n_batch = 512 Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: n_ubatch = 512 Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: flash_attn = 0 Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: freq_base = 1000.0 Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: freq_scale = 1 Sep 25 15:40:19 hostname ollama[268657]: llama_kv_cache_init: CUDA0 KV buffer size = 1152.00 MiB Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: KV self size = 1152.00 MiB, K (f16): 576.00 MiB, V (f16): 576.00 MiB Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: CUDA0 compute buffer size = 22.01 MiB Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: CUDA_Host compute buffer size = 2.51 MiB Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: graph nodes = 453 Sep 25 15:40:19 hostname ollama[268657]: llama_new_context_with_model: graph splits = 2 Sep 25 15:40:19 hostname ollama[269097]: [1727271619] warming up the model with an empty run Sep 25 15:40:19 hostname ollama[268657]: /usr/include/c++/14.2.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed. Sep 25 15:40:20 hostname ollama[268657]: time=2024-09-25T15:40:20.297+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server not responding" Sep 25 15:40:21 hostname ollama[268657]: time=2024-09-25T15:40:21.187+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error" Sep 25 15:40:21 hostname ollama[268657]: time=2024-09-25T15:40:21.438+02:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)" Sep 25 15:40:21 hostname ollama[268657]: [GIN] 2024/09/25 - 15:40:21 | 500 | 2.676901285s | 127.0.0.1 | POST "/api/embed" Sep 25 15:40:26 hostname ollama[268657]: time=2024-09-25T15:40:26.512+02:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.07381477 model=/var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 Sep 25 15:40:26 hostname ollama[268657]: time=2024-09-25T15:40:26.761+02:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.323291756 model=/var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 Sep 25 15:40:27 hostname ollama[268657]: time=2024-09-25T15:40:27.012+02:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.573759559 model=/var/lib/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
When using an embedding model via Ollamas API, Lama.cpp has an assertion error:
Bug: Assertion '__n < this->size()' failed.
I tried nomic-embed-text-v1.5 and all-minilm. It works fine if 100% CPU.
7592 could be related
Name and Version
0.3.6 ollama-cuda from AUR, was not able to find the used lamacpp version.
What operating system are you seeing the problem on?
Linux
Relevant log output