Codestral response is suddenly 4 times slower in Continue when compared to others

abishekmuthian commented 1 month ago

Before submitting your bug report

[X] I believe this is a bug. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that reports the same bug
[X] I've seen the troubleshooting guide on the Continue Docs

Relevant environment info

- OS:Linux 6.9
- GPU: Nvidia 4090 Mobile (16GB VRAM)
- Provider: Ollama
- Continue: 0.8.43
- IDE: VSCode
- Model: Codestral
- config.json:

{
  "models": [
    {
      "title": "Codestral",
      "provider": "ollama",
      "model": "codestral:latest"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "tabAutocompleteModel": {
      "title": "Codestral",
      "provider": "ollama",
      "model": "codestral:latest"
  },
  "allowAnonymousTelemetry": false,
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  }
}

Description

Codestral response via ollama has suddenly become very slow with latest updates, its at least 4 times slower when compared to response timings from other other apps like open-webui, curl are very fast. Other models like deepseek-coder-v2 is working fine in Continue.

To reproduce

Setup Continue to use Codestral for chat and tabautocomplete.
Watch the the logs for ollama e.g. in docker its docker logs --follow ollama
Open Continue chat and give any prompt.
Note response time in the ollama logs and notice the latency in the chat.
Give the same prompt in curl or in open-webui to Codestral and note the response time.

Log output

Note: Model is already loaded in VRAM before testing.

Ollama logs for Codestral via Continue

[GIN] 2024/07/30 - 10:45:28 | 200 |         1m33s |      172.17.0.1 | POST     "/api/chat"

Ollama logs for Codestral via open-webui

[GIN] 2024/07/30 - 10:46:32 | 200 |  5.142411188s |      172.17.0.1 | POST     "/api/chat"

abishekmuthian commented 1 month ago

Issue seems to be related to context size https://github.com/continuedev/continue/issues/1776 , setting new session in chat makes Codestral usable but still not as fast as open-webui.

plashenkov commented 4 weeks ago

Can confirm this. Codestral is extremly slow (one word per ~20 seconds) in Continue, while it is blazing fast using ollama's direct console. A striking difference

P.S. Yep, looks like adjusting context size fixes this.

continuedev / continue