Open PlayerLegend opened 3 months ago
This is a pretty annoying issue. If a response is very long and you decide to cancel the generation, you are left for some time with an unresponsive extension. This can happen if you experiment with models and/or misconfigure a template, but most of the time a valid, but long reply just leads to the same issue.
I've looked into extension code, and it looks like the cancel signal is not passed to LLM fetchers at all, so seems like this is currently by design.
The extension is great and does a good job in general, but sometimes I just had to force-restart ollama/llama.cpp to un-freeze it (or wait painfully). Would be great to prioritize this, because I can't easily recommend this extension without explaining this weird issue and how to deal with it.
cc @sestinj
hello, is this currently being worked on?
I think this bug is severe as it's not possible to interact with a model in any way until it finishes.
Before submitting your bug report
Relevant environment info
Description
When I hit cancel while the extension is streaming tokens, the extension does stop relaying tokens to the screen, but does not cancel the generation on the server. If I queue up another prompt, I then have to wait until the previous request finishes before the new one starts. I believe this is an issue with how the extension handles the connection with the llama.cpp server because restarting vscode immediately stops the generation on the server, and because other frontends that I have used do not exhibit this behavior when stopping a generation. I would expect, when the generation is cancelled in the UI, that it would also be cancelled on the server. This would allow a subsequent generation afterward and save on resources. I am unsure if it is relevant, but here is my config:
To reproduce
No response
Log output
No response