continuedev / continue

⏩ Continue is the leading open-source AI code assistant. You can connect any models and any context to build custom autocomplete and chat experiences inside VS Code and JetBrains
https://docs.continue.dev/
Apache License 2.0
15.91k stars 1.21k forks source link

Codestral response is suddenly 4 times slower in Continue when compared to others #1872

Open abishekmuthian opened 1 month ago

abishekmuthian commented 1 month ago

Before submitting your bug report

Relevant environment info

- OS:Linux 6.9
- GPU: Nvidia 4090 Mobile (16GB VRAM)
- Provider: Ollama
- Continue: 0.8.43
- IDE: VSCode
- Model: Codestral
- config.json:

{
  "models": [
    {
      "title": "Codestral",
      "provider": "ollama",
      "model": "codestral:latest"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "tabAutocompleteModel": {
      "title": "Codestral",
      "provider": "ollama",
      "model": "codestral:latest"
  },
  "allowAnonymousTelemetry": false,
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  }
}

Description

Codestral response via ollama has suddenly become very slow with latest updates, its at least 4 times slower when compared to response timings from other other apps like open-webui, curl are very fast. Other models like deepseek-coder-v2 is working fine in Continue.

To reproduce

  1. Setup Continue to use Codestral for chat and tabautocomplete.
  2. Watch the the logs for ollama e.g. in docker its docker logs --follow ollama
  3. Open Continue chat and give any prompt.
  4. Note response time in the ollama logs and notice the latency in the chat.
  5. Give the same prompt in curl or in open-webui to Codestral and note the response time.

Log output

Note: Model is already loaded in VRAM before testing.

Ollama logs for Codestral via Continue

[GIN] 2024/07/30 - 10:45:28 | 200 |         1m33s |      172.17.0.1 | POST     "/api/chat"

Ollama logs for Codestral via open-webui

[GIN] 2024/07/30 - 10:46:32 | 200 |  5.142411188s |      172.17.0.1 | POST     "/api/chat"
abishekmuthian commented 1 month ago

Issue seems to be related to context size https://github.com/continuedev/continue/issues/1776 , setting new session in chat makes Codestral usable but still not as fast as open-webui.

plashenkov commented 4 weeks ago

Can confirm this. Codestral is extremly slow (one word per ~20 seconds) in Continue, while it is blazing fast using ollama's direct console. A striking difference

P.S. Yep, looks like adjusting context size fixes this.