continuedev / continue

⏩ Continue is the leading open-source AI code assistant. You can connect any models and any context to build custom autocomplete and chat experiences inside VS Code and JetBrains
https://docs.continue.dev/
Apache License 2.0
18.9k stars 1.6k forks source link

Embeddings Provider has typo in model configuration params #936

Open Pawe98 opened 8 months ago

Pawe98 commented 8 months ago

Before submitting your bug report

Relevant environment info

- OS:Windows 10
- Continue:0.8.14
- IDE:VS Code 1.87.0

Description

I have the same ollama model configured as embeddingsProvider and as model. I can see in the ollama logs that the model switches and redeploys the same model but with different configuration. One thing that I could easily identify is (prob due to a typo) the BOS token:

llm_load_print_meta: BOS token = 32013 '<´¢£beginÔûüofÔûüsentence´¢£>'

vs.

llm_load_print_meta: BOS token = 32013 '<´¢£beginÔûüofÔûüsentence´¢£>''

As you can see there is a single quote too much at the end.

Model: deepseek-coder:6.7b

Config.json

{
  "models": [
    {
      "title": "GPT-4 Vision (Free Trial)",
      "provider": "free-trial",
      "model": "gpt-4-vision-preview"
    },
    {
      "title": "GPT-3.5-Turbo (Free Trial)",
      "provider": "free-trial",
      "model": "gpt-3.5-turbo"
    },
    {
      "title": "Gemini Pro (Free Trial)",
      "provider": "free-trial",
      "model": "gemini-pro"
    },
    {
      "title": "Codellama 70b (Free Trial)",
      "provider": "free-trial",
      "model": "codellama-70b"
    },
    {
        "model": "deepseek-coder:6.7b-instruct-q4_K_M",
        "title": "deepseek-coder:6.7b-instruct-q4_K_M",
        "completionOptions": {},
        "apiBase": "http://localhost:11434",
        "provider": "ollama"
    }
  ],
  "slashCommands": [
    {
      "name": "edit",
      "description": "Edit selected code"
    },
    {
      "name": "comment",
      "description": "Write comments for the selected code"
    },
    {
      "name": "share",
      "description": "Export this session as markdown"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "Write a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "contextProviders": [
    {
      "name": "open",
      "params": {}
    },
    {
      "name": "codebase",
      "params": {
        "nRetrieve": 25,
        "nFinal": 5,
        "useReranking": true
      }
    }
  ],
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "deepseek-coder:6.7b-instruct-q4_K_M",
    "apiBase": "http://localhost:11434"
  }
}

To reproduce

use the same ollama model as embeddings provider, perform a query that uses the embeddings like @Codebase

Log output

`this is the log of a programm that loads the same model twice, do you know what is the difference between the loads?

log:

[GIN] 2024/03/06 - 12:59:27 | 200 |          2m5s |       127.0.0.1 | POST     "/api/chat"

time=2024-03-06T12:59:48.815+01:00 level=INFO source=routes.go:78 msg="changing loaded model"

time=2024-03-06T12:59:49.757+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

time=2024-03-06T12:59:49.760+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"

time=2024-03-06T12:59:49.760+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

time=2024-03-06T12:59:49.762+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"

time=2024-03-06T12:59:49.762+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

loading library C:\Users\pawerner\AppData\Local\Temp\ollama664810684\cuda_v11.3\ext_server.dll

time=2024-03-06T12:59:49.768+01:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\pawerner\\AppData\\Local\\Temp\\ollama664810684\\cuda_v11.3\\ext_server.dll"

time=2024-03-06T12:59:49.768+01:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"

llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from C:\Users\pawerner\.ollama\models\blobs\sha256-8de39949f334605a7b8d7167723c9ccc926e506f38405f6d00bdf3df12e8dcf9 (version GGUF V3 (latest))

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.

llama_model_loader: - kv   0:                       general.architecture str              = llama

llama_model_loader: - kv   1:                               general.name str              = deepseek-ai

llama_model_loader: - kv   2:                       llama.context_length u32              = 16384

llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096

llama_model_loader: - kv   4:                          llama.block_count u32              = 32

llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008

llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128

llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32

llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32

llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001

llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 100000.000000

llama_model_loader: - kv  11:                    llama.rope.scaling.type str              = linear

llama_model_loader: - kv  12:                  llama.rope.scaling.factor f32              = 4.000000

llama_model_loader: - kv  13:                          general.file_type u32              = 15

llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2

llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32256]   = ["!", "\"", "#", "$", "%", "&", "'", ...

llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32256]   = [0.000000, 0.000000, 0.000000, 0.0000...

llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32256]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,31757]   = ["─á ─á", "─á t", "─á a", "i n", "h e...

llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 32013

llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32021

llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32014

llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true

llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false

llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...

llama_model_loader: - kv  25:               general.quantization_version u32              = 2

llama_model_loader: - type  f32:   65 tensors

llama_model_loader: - type q4_K:  193 tensors

llama_model_loader: - type q6_K:   33 tensors

llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 256/32256 ).

llm_load_print_meta: format           = GGUF V3 (latest)

llm_load_print_meta: arch             = llama

llm_load_print_meta: vocab type       = BPE

llm_load_print_meta: n_vocab          = 32256

llm_load_print_meta: n_merges         = 31757

llm_load_print_meta: n_ctx_train      = 16384

llm_load_print_meta: n_embd           = 4096

llm_load_print_meta: n_head           = 32

llm_load_print_meta: n_head_kv        = 32

llm_load_print_meta: n_layer          = 32

llm_load_print_meta: n_rot            = 128

llm_load_print_meta: n_embd_head_k    = 128

llm_load_print_meta: n_embd_head_v    = 128

llm_load_print_meta: n_gqa            = 1

llm_load_print_meta: n_embd_k_gqa     = 4096

llm_load_print_meta: n_embd_v_gqa     = 4096

llm_load_print_meta: f_norm_eps       = 0.0e+00

llm_load_print_meta: f_norm_rms_eps   = 1.0e-06

llm_load_print_meta: f_clamp_kqv      = 0.0e+00

llm_load_print_meta: f_max_alibi_bias = 0.0e+00

llm_load_print_meta: n_ff             = 11008

llm_load_print_meta: n_expert         = 0

llm_load_print_meta: n_expert_used    = 0

llm_load_print_meta: rope scaling     = linear

llm_load_print_meta: freq_base_train  = 100000.0

llm_load_print_meta: freq_scale_train = 0.25

llm_load_print_meta: n_yarn_orig_ctx  = 16384

llm_load_print_meta: rope_finetuned   = unknown

llm_load_print_meta: model type       = 7B

llm_load_print_meta: model ftype      = Q4_K - Medium

llm_load_print_meta: model params     = 6.74 B

llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW)

llm_load_print_meta: general.name     = deepseek-ai

llm_load_print_meta: BOS token        = 32013 '<´¢£beginÔûüofÔûüsentence´¢£>'

llm_load_print_meta: EOS token        = 32021 '<|EOT|>'

llm_load_print_meta: PAD token        = 32014 '<´¢£endÔûüofÔûüsentence´¢£>'

llm_load_print_meta: LF token         = 30 '?'

llm_load_tensors: ggml ctx size =    0.22 MiB

llm_load_tensors: offloading 6 repeating layers to GPU

llm_load_tensors: offloaded 6/33 layers to GPU

llm_load_tensors:        CPU buffer size =  3892.62 MiB

llm_load_tensors:      CUDA0 buffer size =   727.62 MiB

..................................................................................................

llama_new_context_with_model: n_ctx      = 2048

llama_new_context_with_model: freq_base  = 100000.0

llama_new_context_with_model: freq_scale = 0.25

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no

ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes

ggml_init_cublas: found 1 CUDA devices:

  Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes

llama_kv_cache_init:  CUDA_Host KV buffer size =   832.00 MiB

llama_kv_cache_init:      CUDA0 KV buffer size =   192.00 MiB

llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB

llama_new_context_with_model:  CUDA_Host input buffer size   =    13.02 MiB

llama_new_context_with_model:      CUDA0 compute buffer size =   164.01 MiB

llama_new_context_with_model:  CUDA_Host compute buffer size =   168.00 MiB

llama_new_context_with_model: graph splits (measure): 5

time=2024-03-06T12:59:51.675+01:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"

[GIN] 2024/03/06 - 12:59:53 | 200 |    4.2588429s |       127.0.0.1 | POST     "/api/embeddings"

time=2024-03-06T12:59:55.566+01:00 level=INFO source=routes.go:78 msg="changing loaded model"

time=2024-03-06T12:59:56.304+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

time=2024-03-06T12:59:56.308+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"

time=2024-03-06T12:59:56.322+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

time=2024-03-06T12:59:56.325+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"

time=2024-03-06T12:59:56.325+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

loading library C:\Users\pawerner\AppData\Local\Temp\ollama664810684\cuda_v11.3\ext_server.dll

time=2024-03-06T12:59:56.333+01:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\pawerner\\AppData\\Local\\Temp\\ollama664810684\\cuda_v11.3\\ext_server.dll"

time=2024-03-06T12:59:56.346+01:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"

llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from C:\Users\pawerner\.ollama\models\blobs\sha256-8de39949f334605a7b8d7167723c9ccc926e506f38405f6d00bdf3df12e8dcf9 (version GGUF V3 (latest))

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.

llama_model_loader: - kv   0:                       general.architecture str              = llama

llama_model_loader: - kv   1:                               general.name str              = deepseek-ai

llama_model_loader: - kv   2:                       llama.context_length u32              = 16384

llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096

llama_model_loader: - kv   4:                          llama.block_count u32              = 32

llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008

llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128

llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32

llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32

llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001

llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 100000.000000

llama_model_loader: - kv  11:                    llama.rope.scaling.type str              = linear

llama_model_loader: - kv  12:                  llama.rope.scaling.factor f32              = 4.000000

llama_model_loader: - kv  13:                          general.file_type u32              = 15

llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2

llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32256]   = ["!", "\"", "#", "$", "%", "&", "'", ...

llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32256]   = [0.000000, 0.000000, 0.000000, 0.0000...

llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32256]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,31757]   = ["─á ─á", "─á t", "─á a", "i n", "h e...

llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 32013

llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32021

llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32014

llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true

llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false

llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...

llama_model_loader: - kv  25:               general.quantization_version u32              = 2

llama_model_loader: - type  f32:   65 tensors

llama_model_loader: - type q4_K:  193 tensors

llama_model_loader: - type q6_K:   33 tensors

llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 256/32256 ).

llm_load_print_meta: format           = GGUF V3 (latest)

llm_load_print_meta: arch             = llama

llm_load_print_meta: vocab type       = BPE

llm_load_print_meta: n_vocab          = 32256

llm_load_print_meta: n_merges         = 31757

llm_load_print_meta: n_ctx_train      = 16384

llm_load_print_meta: n_embd           = 4096

llm_load_print_meta: n_head           = 32

llm_load_print_meta: n_head_kv        = 32

llm_load_print_meta: n_layer          = 32

llm_load_print_meta: n_rot            = 128

llm_load_print_meta: n_embd_head_k    = 128

llm_load_print_meta: n_embd_head_v    = 128

llm_load_print_meta: n_gqa            = 1

llm_load_print_meta: n_embd_k_gqa     = 4096

llm_load_print_meta: n_embd_v_gqa     = 4096

llm_load_print_meta: f_norm_eps       = 0.0e+00

llm_load_print_meta: f_norm_rms_eps   = 1.0e-06

llm_load_print_meta: f_clamp_kqv      = 0.0e+00

llm_load_print_meta: f_max_alibi_bias = 0.0e+00

llm_load_print_meta: n_ff             = 11008

llm_load_print_meta: n_expert         = 0

llm_load_print_meta: n_expert_used    = 0

llm_load_print_meta: rope scaling     = linear

llm_load_print_meta: freq_base_train  = 100000.0

llm_load_print_meta: freq_scale_train = 0.25

llm_load_print_meta: n_yarn_orig_ctx  = 16384

llm_load_print_meta: rope_finetuned   = unknown

llm_load_print_meta: model type       = 7B

llm_load_print_meta: model ftype      = Q4_K - Medium

llm_load_print_meta: model params     = 6.74 B

llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW)

llm_load_print_meta: general.name     = deepseek-ai

llm_load_print_meta: BOS token        = 32013 '<´¢£beginÔûüofÔûüsentence´¢£>''

llm_load_print_meta: EOS token        = 32021 '<|EOT|>'

llm_load_print_meta: PAD token        = 32014 '<´¢£endÔûüofÔûüsentence´¢£>'

llm_load_print_meta: LF token         = 30 '?'

llm_load_tensors: ggml ctx size =    0.22 MiB

llm_load_tensors: offloading 4 repeating layers to GPU

llm_load_tensors: offloaded 4/33 layers to GPU

llm_load_tensors:        CPU buffer size =  3892.62 MiB

llm_load_tensors:      CUDA0 buffer size =   495.22 MiB

..................................................................................................

llama_new_context_with_model: n_ctx      = 4096

llama_new_context_with_model: freq_base  = 100000.0

llama_new_context_with_model: freq_scale = 0.25

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no

ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes

ggml_init_cublas: found 1 CUDA devices:

  Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes

llama_kv_cache_init:  CUDA_Host KV buffer size =  1792.00 MiB

llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB

llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB

llama_new_context_with_model:  CUDA_Host input buffer size   =    17.04 MiB

llama_new_context_with_model:      CUDA0 compute buffer size =   296.02 MiB

llama_new_context_with_model:  CUDA_Host compute buffer size =   296.00 MiB

llama_new_context_with_model: graph splits (measure): 5

time=2024-03-06T12:59:58.240+01:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"

[GIN] 2024/03/06 - 13:02:12 | 200 |         2m17s |       127.0.0.1 | POST     "/api/chat"
Pawe98 commented 8 months ago

other differences found, but maybe they are intentional (would be cool if I can change them):

llama_new_context_with_model: n_ctx = 2048

vs

llama_new_context_with_model: n_ctx = 4096

and some cache and buffer sizes.

The general purpose of the issue is 1. the typo bug and 2. to ask for a fix/feature to make it possible to use the same model for embedding and chat via ollama