Chat/Completion not working with ollama and local models

sbouchet commented 1 week ago

Before submitting your bug report

[X] I believe this is a bug. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that reports the same bug
[X] I've seen the troubleshooting guide on the Continue Docs

Relevant environment info

- OS: Fedora 40
- Continue version: v0.9.225 (pre-release)
- IDE version: VSCode 1.95.1
- Model: granite-8b
- config.json:

{
    "models": [
        {
            "model": "granite-code:8b",
            "provider": "ollama",
            "contextLength": 128000,
            "completionOptions": {
                "maxTokens": 4000,
                "temperature": 0.1,
                "topP": 0.9,
                "topK": 40,
                "presencePenalty": 0,
                "frequencyPenalty": 0.1
            },
            "systemMessage": "You are Granite Code, an AI language model developed by IBM. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. You always respond to greetings (for example, hi, hello, g'day, morning, afternoon, evening, night, what's up, nice to meet you, sup, etc) with \"Hello! I am Granite Code, created by IBM. How can I help you today?\". Please do not say anything else and do not start a conversation.",
            "title": "granite-code:8b"
        }
    ],
    "tabAutocompleteModel": {
        "model": "granite-code:8b",
        "provider": "ollama",
        "contextLength": 128000,
        "completionOptions": {
            "maxTokens": 4000,
            "temperature": 0.1,
            "topP": 0.9,
            "topK": 40,
            "presencePenalty": 0,
            "frequencyPenalty": 0.1
        },
        "systemMessage": "You are Granite Code, an AI language model developed by IBM. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. You always respond to greetings (for example, hi, hello, g'day, morning, afternoon, evening, night, what's up, nice to meet you, sup, etc) with \"Hello! I am Granite Code, created by IBM. How can I help you today?\". Please do not say anything else and do not start a conversation.",
        "title": "granite-code:8b"
    },
    "customCommands": [
        {
            "name": "test",
            "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
            "description": "Write unit tests for highlighted code"
        }
    ],
    "contextProviders": [
        {
            "name": "diff",
            "params": {}
        },
        {
            "name": "folder",
            "params": {}
        },
        {
            "name": "codebase",
            "params": {}
        }
    ],
    "slashCommands": [
        {
            "name": "edit",
            "description": "Edit selected code"
        },
        {
            "name": "comment",
            "description": "Write comments for the selected code"
        },
        {
            "name": "share",
            "description": "Export the current chat session to markdown"
        },
        {
            "name": "commit",
            "description": "Generate a git commit message"
        }
    ],
    "embeddingsProvider": {
        "provider": "ollama",
        "model": "nomic-embed-text:latest",
        "title": "nomic-embed-text:latest"
    }
}

Description

trying to use IBM granite on a lenovo thinkpad with 64Gb RAM, plenty of disk space and an Intel GPU.

model running fine with ollama CLI, thus creating this report.

To reproduce

Capture d’écran du 2024-11-07 11-25-41

Log output

debug LLM Prompt logs : 

#### Prompt #####

            self.csv_df = pd.concat([self.csv_df, df], ignore_index=True)
            print('[combine]: category [%s] done' % category)

        self.csv_df.sort_values(by = ['ST'], inplace = True, ignore_index = True)
        print('[sort]: done')

        # print(self.csv_df.head())
        # print(self.csv_df.info())

    def tagging(self):
        event_list = list(self.csv_df['Event'])
        tag_list = [[] for i in range(len(event_list))]
        # print(len(event_list), event_list[:5])

        for i in range(len(event_list)):
            for keywords, tag in self.tag_dict.items():
                if isinstance(keywords, str):
                    if event_list[i].find(keywords) != -1:
                        tag_list[i].append(tag)
                elif sum([(event_list[i].find(k) != -1) for k in keywords]):
                    tag_list[i].append(tag)

        tags = [' '.join(tag_list[i]) for i in range(len(event_list))]
        self.csv_df['Tag'] = tags
        print('[tagging]: done')

    def output(self, output_file = 'output.csv', start_date = '2000-01-01', end_date = '2100-01-01'):
        print('[output]: creating csv from [%s]' % start_date + 'to [%s].' % end_date)
        <FIM>
        bound = pd.to_datetime(pd.Series([start_date, end_date])).dt.tz_localize(self.tz)
        output_df = self.csv_df[(self.csv_df['ST'] >= bound[0]) & (self.csv_df['ET'] < bound[1])]==========================================================================
==========================================================================
##### Completion options #####
{
  "contextLength": 128000,
  "maxTokens": 4000,
  "temperature": 0.1,
  "topP": 0.9,
  "topK": 40,
  "presencePenalty": 0,
  "frequencyPenalty": 0.1,
  "model": "granite-code:8b",
  "stop": [
    "System:",
    "Question:",
    "Answer:"
  ]
}

##### Request options #####
{}

##### Prompt #####
<system>
You are Granite Code, an AI language model developed by IBM. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. You always respond to greetings (for example, hi, hello, g'day, morning, afternoon, evening, night, what's up, nice to meet you, sup, etc) with "Hello! I am Granite Code, created by IBM. How can I help you today?". Please do not say anything else and do not start a conversation.

<user>
"""main.py (107-107)
        print('[output]: done')
"""
please explain

msivasubramaniaan commented 1 week ago

+1

RomneyDa commented 1 week ago

@sbouchet @msivasubramaniaan can you run ollama list and see which models are running?

@msivasubramaniaan which model are you using?

My intuition is it's trying to load multiple models at once. Or possible some massive model was renamed or something. It's probably not a Continue bug since there's no way to specify memory or multiple models, but I'd be curious to get to the bottom of this

RomneyDa commented 1 week ago

3rd here possible sending some param to ollama causing this failure?

RomneyDa commented 1 week ago

@sbouchet I'd call out that granite isn't trained for fill-in-the-middle code completion and is unlikely to work well for that, would be a solid chat model though!

fbricon commented 1 week ago

@RomneyDa the current granite-code (gen 2) models are trained with FIM. The new gen3 are not, so that'll be a future problem.

fbricon commented 1 week ago

FTR, granite-code:3b and :8b run fine in Continue on Mac (>= M2), as ollama can use the GPUs there. On Lenovos, ollama falls back to CPU, which requires more memory., which might explain the tab completion issue. However there's no good reason why Chat would work from CLI, but not from Continue UI

fbricon commented 1 week ago

Hey @gabe-l-hart, sorry to drag you in, but do you have any insights on the ollama/granite failure on tab completion when running on CPU?

sbouchet commented 1 week ago

@sbouchet @msivasubramaniaan can you run ollama list and see which models are running?

here is the output:

$ ollama list
NAME                       ID              SIZE      MODIFIED     
granite-code:8b            36c3c3b9683b    4.6 GB    41 hours ago    
granite-code:3b            becc94fe1876    2.0 GB    42 hours ago    
nomic-embed-text:latest    0a109f422b47    274 MB    42 hours ago

gabe-l-hart commented 1 week ago

Hi @fbricon and team! I don't have any direct insight on this since I've done my work on an M3, but I have a few thoughts/ideas. There are really only a few things that can be different between running through Continue and running through the ollama CLI:

The eventual template-expanded prompt that the model has to process is different
The client-side connection to the server is different
- This is actually my strongest hunch: There's probably some kind of timeout logic on the client side in Continue that's causing it to fail on CPU for long context completions
The endpoint that the client is using is different and the codepath in ollama has some difference
- Ollama implements their own API behind /api (/api/chat, /api/completions) which is slightly different than their OpenAI compatibility layer behind /v1 (/v1/chat/completions, /v1/completions). The Ollama CLI uses the Ollama API directly. I'd have to dig to see what Continue is using.

gabe-l-hart commented 1 week ago

One thing I see after a small bit of digging: It looks like we don't yet have granite-code in the supported FIM templates. Following the logic in getTemplateForModel, this means it's falling back to the starcoder template (here). This appears to be close to the FIM template used by Granite Code, but not quite identical (e.g. <file_sep>, <fim_pad>).

fbricon commented 1 week ago

@gabe-l-hart while granite v2 models can probably benefit from some FIM template tuning, v3 won't since they're not FIM-trained, as I was told.

Anyways the immediate issue with tab completion here is ollama erroring with "model requires more system memory" when running on CPU.

gabe-l-hart commented 1 week ago

Ah, I missed the error in the screenshot. This points to the prompt expansion being the problem. I haven't plumbed the depths of the autocomplete module yet, but given some of the function naming in there, I suspect the context is getting too big causing the memory needed to grow significantly (i.e. additional context injection is the issue).

You could try to simulate this with ollama run by giving it a much larger blob of context and seeing if you get the same error.

maxandersen commented 1 week ago

fwiw, it is not only granite that is not working on latest continue for me. Fails with starcode2 and llama too - its as if result never gets back to the client.

pepijndevos commented 6 days ago

I'm also having problems with some models on my Intel Arc, but I'm starting to believe it's an Intel/Ollama problem, not a Continue problem.

When I run deepseek-coder:6.7b-base via Docker I get no reply as indicated here, while the CLI seems to work at least somewhat. However, when I run the ollama arch package, using the CPU, I do get successful completions.

Most chat models work fine for me, but I've seem several other models that behave erratically. For example on deepseek-coder-v2 on GPU I just get """"""""""""""""""""""""""" as a response.

qqshow commented 5 days ago

Some problem! I tried starcoder2 and deepseek-coder.

pepijndevos commented 5 days ago

@qqshow are you using an Intel GPU?

Reported upstream: https://github.com/intel-analytics/ipex-llm/issues/12374

alejandroqh commented 5 days ago

Im also experiencing issues with Ollama and local models; they suddenly stop working, specifically with the autocomplete feature. After some investigation, I realized this problem started after updating from Ollama v0.3.14 to v0.4.1.

To test my theory, I downgraded Ollama to v0.3.14, and the autocomplete functionality returned to normal. I then tested with v0.4.0, and the issue reappeared. Based on this, I suspect that an incompatibility with autocomplete was introduced in version v0.4.0.

For anyone else experiencing this, you can try downgrading via the releases page: Ollama GitHub Releases. For now, I’ll stick with v0.3.14 as a temporary workaround.

gabe-l-hart commented 5 days ago

That's an interesting find @alejandroqh. It may be worth posting an issue in their tracker since I know that 0.4.* includes a major architectural overhaul to a new go-based server for running the individual model processes. It's quite likely that this is related somehow.

pepijndevos commented 5 days ago

There are likely multiple unrelated issues here at play:

People with Intel GPUs that have broken models
People who have problems with 0.4
???

Personally 0.4 works for me when running on CPU, but running a 0.3 docker image that supports my intel GPU does not work.

So it'd be useful if people that are having trouble would mention

the version they are using
the CPU/GPU they are using

And for debugging it could be useful to run an older version or force it to use the CPU to eliminate GPU bugs.

msivasubramaniaan commented 5 days ago

I just set the "contextLength": 12768 in config.json as hard fix and try to run the continue chat on my intel based laptop, and it is working. Earlier it was stuck.

ollama version is 0.3.14 model granite-code:8b

sbouchet commented 4 days ago

Capture d’écran du 2024-11-12 11-22-00

with @msivasubramaniaan fix, my setup started to work as expected. i can get chat and completion working, with no other changes needed.

alejandroqh commented 4 days ago

I tested "contextLength": 12768 in config.json suggested by @msivasubramaniaan in ollama v0.4.1 but did not work for me. Only is working on ollama v0.3.14

kevin-pw commented 3 days ago

Im also experiencing issues with Ollama and local models; they suddenly stop working, specifically with the autocomplete feature.

I can confirm that completions suddenly stop working with ollama v0.4.1 and Continue.dev v0.9.228 (pre-release) in VSCode v1.95.2 using the model deepseek-coder-v2:16b.

I haven't yet found the exact steps to reproduce the issue, but in my testing continue.dev and/or ollama stop responding after a few completion requests (around 2 to 10) have been sent to ollama in short succession (for example, when autocomplete sends repeated completion requests to the ollama API as you are typing).

Once the issue occurs, continue.dev blocks the ollama API. Other applications can no longer receive completions from the API, that means sending a separate curl request to ollama will receive no response until the user closes VSCode. Once VSCode has been exited, ollama continues responding to requests (including those requests from other applications that were on hold while the issue occurred).

I checked the console logs in VSCode, but the logs remain blank while the issue occurs. The ollama logs (journalctl -u ollama --no-pager) shown below do not indicate a readily apparent problem either.

In the logs below, you can see how continue.dev is sending three requests to the ollama API. The first two requests work correctly and receive a response within a few seconds. But the third request never receives a response. Notice the timestamp delay of 1 minute when I close VSCode. During that non-responsive minute, continue.dev just spins it's wheel in the bottom right corner of the IDE.

FIRST REQUEST SENT - WORKS AS INTENDED
Nov 13 06:11:07 taxtux ollama[7358]: time=2024-11-13T06:11:07.941-08:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046
Nov 13 06:11:07 taxtux ollama[7358]: time=2024-11-13T06:11:07.941-08:00 level=DEBUG source=routes.go:270 msg="generate request" images=0 prompt="some long prompt"
Nov 13 06:11:07 taxtux ollama[7358]: time=2024-11-13T06:11:07.948-08:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=885 prompt=878 used=874 remaining=4
SECOND REQUEST SENT - I THINK THIS REQUEST RECEIVED A RESPONSE
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.865-08:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.865-08:00 level=DEBUG source=routes.go:270 msg="generate request" images=0 prompt="another long prompt"
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.896-08:00 level=DEBUG source=sched.go:407 msg="context for request finished"
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.896-08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 refCount=1
Nov 13 06:11:12 taxtux ollama[7358]: [GIN] 2024/11/13 - 06:11:12 | 200 |   4.96354538s |       127.0.0.1 | POST     "/api/generate"
THIRD REQUEST SENT - REMAINS UNANSWERED 
Nov 13 06:11:12 taxtux ollama[7358]: time=2024-11-13T06:11:12.921-08:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=1075 prompt=878 used=874 remaining=4
Nov 13 06:11:14 taxtux ollama[7358]: time=2024-11-13T06:11:14.052-08:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046
Nov 13 06:11:14 taxtux ollama[7358]: time=2024-11-13T06:11:14.052-08:00 level=DEBUG source=routes.go:270 msg="generate request" images=0 prompt="a third long prompt"
Nov 13 06:11:14 taxtux ollama[7358]: time=2024-11-13T06:11:14.091-08:00 level=DEBUG source=sched.go:407 msg="context for request finished"
Nov 13 06:11:14 taxtux ollama[7358]: time=2024-11-13T06:11:14.091-08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 refCount=1
Nov 13 06:11:14 taxtux ollama[7358]: [GIN] 2024/11/13 - 06:11:14 | 200 |  1.237498363s |       127.0.0.1 | POST     "/api/generate"
THE NEXT LOG ENTRY OCCURRS 1 MINUTE LATER BECAUSE HERE I CLOSE VSCODE
Nov 13 06:12:14 taxtux ollama[7358]: time=2024-11-13T06:12:14.979-08:00 level=DEBUG source=sched.go:407 msg="context for request finished"
Nov 13 06:12:14 taxtux ollama[7358]: time=2024-11-13T06:12:14.979-08:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 duration=30m0s
Nov 13 06:12:14 taxtux ollama[7358]: time=2024-11-13T06:12:14.979-08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 refCount=0
Nov 13 06:12:14 taxtux ollama[7358]: [GIN] 2024/11/13 - 06:12:14 | 200 |          1m0s |       127.0.0.1 | POST     "/api/generate"

I am not sure if this is an issue with continue.dev or with ollama, but I am posting this issue here for now because I was unable to reproduce the issue outside of using VSCode. Sending repeated curl requests to the ollama API did not result in any issues.

This issue might be separate from the issue described by the OP. I might file a separate issue ticket if that is helpful.

ivarec commented 3 days ago

Had this issue as well. I've investigated further and it seems that ollama makes a bad decision about a parameter named "numParallel", which has 4 as its default value:

https://github.com/ollama/ollama/blob/d7eb05b9361febead29a74e71ddffc2ebeff5302/server/sched.go#L59

It's not clear in which scenarios the default gets used, but it seems to be a combination of CPU and GPU available in the host. This seems to cause ollama to multiply the needed memory (RAM, in the case of CPU) by 4, which triggers this issue.

The workaround that I've found is to set OLLAMA_NUM_PARALLEL to 1 when running ollama. For example:

OLLAMA_NUM_PARALLEL=1 ollama serve

This solves the issue and makes Continue work again with ollama (in my case).

@RomneyDa I'd suggest adding a temporary error message enchancer that can help Continue's users in this case, since it's not a Continue bug. Until ollama fixes this, if Continue detects the "model requires more system memory..." error message, it could append something like "Try adding OLLAMA_NUM_PARALLEL=1 as an env var to your ollama instance".

EDIT: this doesn't completely work around the issue. It seems that the ollama still estimates twice the needed memory, even after setting OLLAMA_NUM_PARALLEL=1. The work around improves the situation, but doesn't fix it. I've tried to follow ollama's implementation, but it's way too confusing for me to care right now. They seem to count the number of "gpus" in my system, but my cpu with avx2 gets counted as a gpu with a lot of corner cases and if-then-elses. It's messy stuff.

fbricon commented 3 days ago

I've looked into this and found a potential fix, although it's not ready for a PR yet: https://github.com/continuedev/continue/compare/main...fbricon:cancel-completions

But basically, my approach is to use VS Code's cancellation token to abort the pending ollama requests that are issued as you type. AFAICT, this is working way better now, makes the completion feel even snappier I would say.

fbricon commented 2 days ago

I've uploaded a build with my fix @ https://github.com/fbricon/continue/releases/tag/cancel-completions-build if you guys could check/confirm it improves ollama 0.4.x supports, I'd appreciate it. Thanks!

pepijndevos commented 2 days ago

Okay I can confirm that on Intel the context length workaround worked! How did you figure that out? Looking at ollama granite code has a context length of 128000.

My config ftr:

  "tabAutocompleteModel": {
    "title": "granite code 8b",
    "provider": "ollama",
    "model": "granite-code:8b",
    "contextLength": 12768
  },

I am also having problems if a model doesn't complete that the whole system gets clogged so the 0.4 fix also seems very exciting, except us intel people are stuck on 0.3 it seems.

continuedev / continue