Open sbouchet opened 1 week ago
+1
@sbouchet @msivasubramaniaan can you run ollama list
and see which models are running?
@msivasubramaniaan which model are you using?
My intuition is it's trying to load multiple models at once. Or possible some massive model was renamed or something. It's probably not a Continue bug since there's no way to specify memory or multiple models, but I'd be curious to get to the bottom of this
3rd here possible sending some param to ollama causing this failure?
@sbouchet I'd call out that granite isn't trained for fill-in-the-middle code completion and is unlikely to work well for that, would be a solid chat model though!
@RomneyDa the current granite-code (gen 2) models are trained with FIM. The new gen3 are not, so that'll be a future problem.
FTR, granite-code:3b and :8b run fine in Continue on Mac (>= M2), as ollama can use the GPUs there. On Lenovos, ollama falls back to CPU, which requires more memory., which might explain the tab completion issue. However there's no good reason why Chat would work from CLI, but not from Continue UI
Hey @gabe-l-hart, sorry to drag you in, but do you have any insights on the ollama/granite failure on tab completion when running on CPU?
@sbouchet @msivasubramaniaan can you run
ollama list
and see which models are running?here is the output:
$ ollama list NAME ID SIZE MODIFIED granite-code:8b 36c3c3b9683b 4.6 GB 41 hours ago granite-code:3b becc94fe1876 2.0 GB 42 hours ago nomic-embed-text:latest 0a109f422b47 274 MB 42 hours ago
Hi @fbricon and team! I don't have any direct insight on this since I've done my work on an M3, but I have a few thoughts/ideas. There are really only a few things that can be different between running through Continue and running through the ollama
CLI:
ollama
has some difference
/api
(/api/chat
, /api/completions
) which is slightly different than their OpenAI compatibility layer behind /v1
(/v1/chat/completions
, /v1/completions
). The Ollama CLI uses the Ollama API directly. I'd have to dig to see what Continue is using.One thing I see after a small bit of digging: It looks like we don't yet have granite-code
in the supported FIM templates. Following the logic in getTemplateForModel, this means it's falling back to the starcoder
template (here). This appears to be close to the FIM template used by Granite Code, but not quite identical (e.g. <file_sep>
, <fim_pad>
).
@gabe-l-hart while granite v2 models can probably benefit from some FIM template tuning, v3 won't since they're not FIM-trained, as I was told.
Anyways the immediate issue with tab completion here is ollama erroring with "model requires more system memory" when running on CPU.
Ah, I missed the error in the screenshot. This points to the prompt expansion being the problem. I haven't plumbed the depths of the autocomplete module yet, but given some of the function naming in there, I suspect the context is getting too big causing the memory needed to grow significantly (i.e. additional context injection is the issue).
You could try to simulate this with ollama run
by giving it a much larger blob of context and seeing if you get the same error.
fwiw, it is not only granite that is not working on latest continue for me. Fails with starcode2 and llama too - its as if result never gets back to the client.
I'm also having problems with some models on my Intel Arc, but I'm starting to believe it's an Intel/Ollama problem, not a Continue problem.
When I run deepseek-coder:6.7b-base
via Docker I get no reply as indicated here, while the CLI seems to work at least somewhat. However, when I run the ollama arch package, using the CPU, I do get successful completions.
Most chat models work fine for me, but I've seem several other models that behave erratically. For example on deepseek-coder-v2
on GPU I just get """""""""""""""""""""""""""
as a response.
Some problem! I tried starcoder2 and deepseek-coder.
@qqshow are you using an Intel GPU?
Reported upstream: https://github.com/intel-analytics/ipex-llm/issues/12374
Im also experiencing issues with Ollama and local models; they suddenly stop working, specifically with the autocomplete feature. After some investigation, I realized this problem started after updating from Ollama v0.3.14 to v0.4.1.
To test my theory, I downgraded Ollama to v0.3.14, and the autocomplete functionality returned to normal. I then tested with v0.4.0, and the issue reappeared. Based on this, I suspect that an incompatibility with autocomplete was introduced in version v0.4.0.
For anyone else experiencing this, you can try downgrading via the releases page: Ollama GitHub Releases. For now, I’ll stick with v0.3.14 as a temporary workaround.
That's an interesting find @alejandroqh. It may be worth posting an issue in their tracker since I know that 0.4.*
includes a major architectural overhaul to a new go-based server for running the individual model processes. It's quite likely that this is related somehow.
There are likely multiple unrelated issues here at play:
Personally 0.4 works for me when running on CPU, but running a 0.3 docker image that supports my intel GPU does not work.
So it'd be useful if people that are having trouble would mention
And for debugging it could be useful to run an older version or force it to use the CPU to eliminate GPU bugs.
I just set the "contextLength": 12768
in config.json
as hard fix and try to run the continue
chat on my intel based laptop, and it is working. Earlier it was stuck.
ollama version is 0.3.14
model granite-code:8b
with @msivasubramaniaan fix, my setup started to work as expected. i can get chat and completion working, with no other changes needed.
I tested "contextLength": 12768
in config.json
suggested by @msivasubramaniaan in ollama v0.4.1 but did not work for me. Only is working on ollama v0.3.14
Im also experiencing issues with Ollama and local models; they suddenly stop working, specifically with the autocomplete feature.
I can confirm that completions suddenly stop working with ollama v0.4.1
and Continue.dev v0.9.228 (pre-release)
in VSCode v1.95.2
using the model deepseek-coder-v2:16b
.
I haven't yet found the exact steps to reproduce the issue, but in my testing continue.dev and/or ollama stop responding after a few completion requests (around 2 to 10) have been sent to ollama in short succession (for example, when autocomplete sends repeated completion requests to the ollama API as you are typing).
Once the issue occurs, continue.dev blocks the ollama API. Other applications can no longer receive completions from the API, that means sending a separate curl request to ollama will receive no response until the user closes VSCode. Once VSCode has been exited, ollama continues responding to requests (including those requests from other applications that were on hold while the issue occurred).
I checked the console logs in VSCode, but the logs remain blank while the issue occurs. The ollama logs (journalctl -u ollama --no-pager
) shown below do not indicate a readily apparent problem either.
In the logs below, you can see how continue.dev is sending three requests to the ollama API. The first two requests work correctly and receive a response within a few seconds. But the third request never receives a response. Notice the timestamp delay of 1 minute when I close VSCode. During that non-responsive minute, continue.dev just spins it's wheel in the bottom right corner of the IDE.
FIRST REQUEST SENT - WORKS AS INTENDED
Nov 13 06:11:07 taxtux ollama[7358]: time=2024-11-13T06:11:07.941-08:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046
Nov 13 06:11:07 taxtux ollama[7358]: time=2024-11-13T06:11:07.941-08:00 level=DEBUG source=routes.go:270 msg="generate request" images=0 prompt="some long prompt"
Nov 13 06:11:07 taxtux ollama[7358]: time=2024-11-13T06:11:07.948-08:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=885 prompt=878 used=874 remaining=4
SECOND REQUEST SENT - I THINK THIS REQUEST RECEIVED A RESPONSE
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.865-08:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.865-08:00 level=DEBUG source=routes.go:270 msg="generate request" images=0 prompt="another long prompt"
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.896-08:00 level=DEBUG source=sched.go:407 msg="context for request finished"
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.896-08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 refCount=1
Nov 13 06:11:12 taxtux ollama[7358]: [GIN] 2024/11/13 - 06:11:12 | 200 | 4.96354538s | 127.0.0.1 | POST "/api/generate"
THIRD REQUEST SENT - REMAINS UNANSWERED
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.921-08:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=1075 prompt=878 used=874 remaining=4
Nov 13 06:11:14 taxtux ollama[7358]: time=2024-11-13T06:11:14.052-08:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046
Nov 13 06:11:14 taxtux ollama[7358]: time=2024-11-13T06:11:14.052-08:00 level=DEBUG source=routes.go:270 msg="generate request" images=0 prompt="a third long prompt"
Nov 13 06:11:14 taxtux ollama[7358]: time=2024-11-13T06:11:14.091-08:00 level=DEBUG source=sched.go:407 msg="context for request finished"
Nov 13 06:11:14 taxtux ollama[7358]: time=2024-11-13T06:11:14.091-08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 refCount=1
Nov 13 06:11:14 taxtux ollama[7358]: [GIN] 2024/11/13 - 06:11:14 | 200 | 1.237498363s | 127.0.0.1 | POST "/api/generate"
THE NEXT LOG ENTRY OCCURRS 1 MINUTE LATER BECAUSE HERE I CLOSE VSCODE
Nov 13 06:12:14 taxtux ollama[7358]: time=2024-11-13T06:12:14.979-08:00 level=DEBUG source=sched.go:407 msg="context for request finished"
Nov 13 06:12:14 taxtux ollama[7358]: time=2024-11-13T06:12:14.979-08:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 duration=30m0s
Nov 13 06:12:14 taxtux ollama[7358]: time=2024-11-13T06:12:14.979-08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 refCount=0
Nov 13 06:12:14 taxtux ollama[7358]: [GIN] 2024/11/13 - 06:12:14 | 200 | 1m0s | 127.0.0.1 | POST "/api/generate"
I am not sure if this is an issue with continue.dev or with ollama, but I am posting this issue here for now because I was unable to reproduce the issue outside of using VSCode. Sending repeated curl requests to the ollama API did not result in any issues.
This issue might be separate from the issue described by the OP. I might file a separate issue ticket if that is helpful.
Had this issue as well. I've investigated further and it seems that ollama makes a bad decision about a parameter named "numParallel", which has 4 as its default value:
https://github.com/ollama/ollama/blob/d7eb05b9361febead29a74e71ddffc2ebeff5302/server/sched.go#L59
It's not clear in which scenarios the default gets used, but it seems to be a combination of CPU and GPU available in the host. This seems to cause ollama to multiply the needed memory (RAM, in the case of CPU) by 4, which triggers this issue.
The workaround that I've found is to set OLLAMA_NUM_PARALLEL to 1 when running ollama. For example:
OLLAMA_NUM_PARALLEL=1 ollama serve
This solves the issue and makes Continue work again with ollama (in my case).
@RomneyDa I'd suggest adding a temporary error message enchancer that can help Continue's users in this case, since it's not a Continue bug. Until ollama fixes this, if Continue detects the "model requires more system memory..." error message, it could append something like "Try adding OLLAMA_NUM_PARALLEL=1 as an env var to your ollama instance".
EDIT: this doesn't completely work around the issue. It seems that the ollama still estimates twice the needed memory, even after setting OLLAMA_NUM_PARALLEL=1. The work around improves the situation, but doesn't fix it. I've tried to follow ollama's implementation, but it's way too confusing for me to care right now. They seem to count the number of "gpus" in my system, but my cpu with avx2 gets counted as a gpu with a lot of corner cases and if-then-elses. It's messy stuff.
I've looked into this and found a potential fix, although it's not ready for a PR yet: https://github.com/continuedev/continue/compare/main...fbricon:cancel-completions
But basically, my approach is to use VS Code's cancellation token to abort the pending ollama requests that are issued as you type. AFAICT, this is working way better now, makes the completion feel even snappier I would say.
I've uploaded a build with my fix @ https://github.com/fbricon/continue/releases/tag/cancel-completions-build if you guys could check/confirm it improves ollama 0.4.x supports, I'd appreciate it. Thanks!
Okay I can confirm that on Intel the context length workaround worked! How did you figure that out? Looking at ollama granite code has a context length of 128000.
My config ftr:
"tabAutocompleteModel": {
"title": "granite code 8b",
"provider": "ollama",
"model": "granite-code:8b",
"contextLength": 12768
},
I am also having problems if a model doesn't complete that the whole system gets clogged so the 0.4 fix also seems very exciting, except us intel people are stuck on 0.3 it seems.
Before submitting your bug report
Relevant environment info
Description
trying to use IBM granite on a lenovo thinkpad with 64Gb RAM, plenty of disk space and an Intel GPU.
model running fine with ollama CLI, thus creating this report.
To reproduce
Log output