Canceled prompt continues streaming in the background

PlayerLegend commented 3 months ago

Before submitting your bug report

[X] I believe this is a bug. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that reports the same bug
[X] I've seen the troubleshooting guide on the Continue Docs

Relevant environment info

- OS:Debian GNU/Linux 12 (bookworm) x86_64
- Continue:v0.8.40
- IDE:vscode
- LLM: llama.cpp server

Description

When I hit cancel while the extension is streaming tokens, the extension does stop relaying tokens to the screen, but does not cancel the generation on the server. If I queue up another prompt, I then have to wait until the previous request finishes before the new one starts. I believe this is an issue with how the extension handles the connection with the llama.cpp server because restarting vscode immediately stops the generation on the server, and because other frontends that I have used do not exhibit this behavior when stopping a generation. I would expect, when the generation is cancelled in the UI, that it would also be cancelled on the server. This would allow a subsequent generation afterward and save on resources. I am unsure if it is relevant, but here is my config:

{
  "models": [
    {
      "title": "Llama CPP",
      "provider": "llama.cpp",
      "model": "MODEL_NAME",
      "apiBase": "http://my.domain:8080",
      "completionOptions": {
        "stop": ["<|im_end|>"]
      }
    }
  ]
}

To reproduce

No response

Log output

No response

goodov commented 3 months ago

This is a pretty annoying issue. If a response is very long and you decide to cancel the generation, you are left for some time with an unresponsive extension. This can happen if you experiment with models and/or misconfigure a template, but most of the time a valid, but long reply just leads to the same issue.

I've looked into extension code, and it looks like the cancel signal is not passed to LLM fetchers at all, so seems like this is currently by design.

The extension is great and does a good job in general, but sometimes I just had to force-restart ollama/llama.cpp to un-freeze it (or wait painfully). Would be great to prioritize this, because I can't easily recommend this extension without explaining this weird issue and how to deal with it.

cc @sestinj

robertpiosik commented 3 months ago

hello, is this currently being worked on?

I think this bug is severe as it's not possible to interact with a model in any way until it finishes.

continuedev / continue