continuedev / continue

⏩ Continue is the leading open-source AI code assistant. You can connect any models and any context to build custom autocomplete and chat experiences inside VS Code and JetBrains
https://docs.continue.dev/
Apache License 2.0
19.59k stars 1.71k forks source link

Cannot index projects, embeddings errors 500 from ollama #1860

Open Meister1593 opened 4 months ago

Meister1593 commented 4 months ago

Before submitting your bug report

Relevant environment info

- OS: Archlinux
- Continue: 0.8.43
- IDE: VSCodium 1.89.1
- Model: Ollama 0.2.8, nomic-embed-text-v1.5
- config.json:

{
  "models": [
    {
      "title": "Llama 3",
      "provider": "ollama",
      "model": "llama3"
    },
    {
      "title": "Ollama",
      "provider": "ollama",
      "contextLength": 1024,
      "model": "AUTODETECT"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Starcoder 3b",
    "provider": "ollama",
    "model": "starcoder2:3b"
  },
  "allowAnonymousTelemetry": false,
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  }
}

Description

When trying to index any project, it infinitely tries to index and never finishes, also causing major cpu/memory usage in the process

To reproduce

  1. Install continue release
  2. Open any folder with project files
  3. Indexing starts
  4. Indexing never finishes, stays indexing

Log output

Plugin logs:
/home/plyshka/Documents/Code/Kubernetes/plyshserver/coturn/configmap.yaml with provider: OllamaEmbeddingsProvider::nomic-embed-text: Error: Failed to embed chunk: {"error":"llama runner process has terminated: signal: aborted (core dumped)"}

Ollama logs:
time=2024-07-29T03:16:58.535+05:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)"
[GIN] 2024/07/29 - 03:16:58 | 500 |  3.413399187s |       127.0.0.1 | POST     "/api/embeddings"
time=2024-07-29T03:16:58.546+05:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=13 layers.offload=0 layers.split="" memory.available="[14.1 GiB]" memory.required.full="574.9 MiB" memory.required.partial="0 B" memory.required.kv="96.0 MiB" memory.required.allocations="[574.9 MiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"
time=2024-07-29T03:16:58.546+05:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama4031942035/runners/cpu_avx2/ollama_llama_server --model /home/plyshka/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 37837"
time=2024-07-29T03:16:58.546+05:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-29T03:16:58.546+05:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
time=2024-07-29T03:16:58.546+05:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3440 commit="d94c6e0cc" tid="139283458398016" timestamp=1722205018
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="139283458398016" timestamp=1722205018 total_threads=12
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="37837" tid="139283458398016" timestamp=1722205018
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /home/plyshka/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  23:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = nomic-bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 137M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 136.73 M
llm_load_print_meta: model size       = 260.86 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = nomic-embed-text-v1.5
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: max token length = 21
llm_load_tensors: ggml ctx size =    0.05 MiB
llm_load_tensors:        CPU buffer size =   260.86 MiB
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1152.00 MiB
llama_new_context_with_model: KV self size  = 1152.00 MiB, K (f16):  576.00 MiB, V (f16):  576.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
[1722205018] warming up the model with an empty run
llama_new_context_with_model:        CPU compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 1
/usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
time=2024-07-29T03:16:58.998+05:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server not responding"
spew commented 4 months ago

This may be fixed with some of the improvements that have been done in in the dev branch:

Can you try 0.8.44-vscode?

Meister1593 commented 4 months ago

This may be fixed with some of the improvements that have been done in in the dev branch:

* [Simplify walkDir(...) and improve performance by roughly 10x in larger repos #1806](https://github.com/continuedev/continue/pull/1806)

* [Remove a copy / pasta section that duplicated the work done in ChunkCodebaseIndexer #1834](https://github.com/continuedev/continue/pull/1834)

* [Offload all token counting to worker processes as well as some optimizations to do more token counting in parallel #1842](https://github.com/continuedev/continue/pull/1842)

Can you try 0.8.44-vscode?

Just updated, no change, it repeatedly crashes with same assert error, even when indexing is paused for some reason...

Patrick-Erichsen commented 4 months ago

@Meister1593 - mind trying the latest pre-release as well? v0.9.191 (pre-release)

nayeemtby commented 3 months ago

Same issue. Manjaro, ollama-cuda 0.3.3, nomic-embed-text:latest. extension version v0.8.46

fry69 commented 3 months ago

This looks like a crashing Ollama server to me, maybe ask in the Ollama Discord or GitHub repo what this crash log means?

Meister1593 commented 3 months ago

@Meister1593 - mind trying the latest pre-release as well? v0.9.191 (pre-release)

Sorry for the wait, did not notice this ping, but checking now om v0.9.196 results in this: image By the looks of it, nothing has changed

Meister1593 commented 3 months ago

This looks like a crashing Ollama server to me, maybe ask in the Ollama Discord or GitHub repo what this crash log means?

Will try to check with them!

nayeemtby commented 3 months ago

My ollama was also crashing with nomic embed. Using mxbai-embed-large fixed embedding issues and indexing works fine.

Meister1593 commented 3 months ago

mxbai-embed-large

Just tried to use it, seems like it have started to embed just fine?

*Just needed to pull it after changing in config and switch projects around

nayeemtby commented 3 months ago

You can check the request log of ollama to see if it's working. Embeddings didn't seem to take long

Meister1593 commented 3 months ago

You can check the request log of ollama to see if it's working. Embeddings didn't seem to take long

Yes, embeddings frequently occuring with 200 code, i'm checking pretty big codebase so this might take a while...