continuedev / continue

⏩ Continue is the leading open-source AI code assistant. You can connect any models and any context to build custom autocomplete and chat experiences inside VS Code and JetBrains
https://docs.continue.dev/
Apache License 2.0
17.46k stars 1.35k forks source link

Canceled prompt continues streaming in the background #1449

Open PlayerLegend opened 3 months ago

PlayerLegend commented 3 months ago

Before submitting your bug report

Relevant environment info

- OS:Debian GNU/Linux 12 (bookworm) x86_64
- Continue:v0.8.40
- IDE:vscode
- LLM: llama.cpp server

Description

When I hit cancel while the extension is streaming tokens, the extension does stop relaying tokens to the screen, but does not cancel the generation on the server. If I queue up another prompt, I then have to wait until the previous request finishes before the new one starts. I believe this is an issue with how the extension handles the connection with the llama.cpp server because restarting vscode immediately stops the generation on the server, and because other frontends that I have used do not exhibit this behavior when stopping a generation. I would expect, when the generation is cancelled in the UI, that it would also be cancelled on the server. This would allow a subsequent generation afterward and save on resources. I am unsure if it is relevant, but here is my config:

{
  "models": [
    {
      "title": "Llama CPP",
      "provider": "llama.cpp",
      "model": "MODEL_NAME",
      "apiBase": "http://my.domain:8080",
      "completionOptions": {
        "stop": ["<|im_end|>"]
      }
    }
  ]
}

To reproduce

No response

Log output

No response

goodov commented 3 months ago

This is a pretty annoying issue. If a response is very long and you decide to cancel the generation, you are left for some time with an unresponsive extension. This can happen if you experiment with models and/or misconfigure a template, but most of the time a valid, but long reply just leads to the same issue.

I've looked into extension code, and it looks like the cancel signal is not passed to LLM fetchers at all, so seems like this is currently by design.

The extension is great and does a good job in general, but sometimes I just had to force-restart ollama/llama.cpp to un-freeze it (or wait painfully). Would be great to prioritize this, because I can't easily recommend this extension without explaining this weird issue and how to deal with it.

cc @sestinj

robertpiosik commented 3 months ago

hello, is this currently being worked on?

I think this bug is severe as it's not possible to interact with a model in any way until it finishes.