Unable to use diff editing with local instsance of qwen2.5-coder-32b-instruct

kbsphere commented 1 week ago

Issue

I am receiving the following error when attempting to use my local Llama instance of qwen2.5-coder-32b-instruct. (Ubuntu 24.04) When starting aider and pointing to my ollama instance with the following command

aider --model ollama/qwen2.5-coder-32b-instruct

I get the following errors:

OllamaError: Error getting model info for ollama/qwen2.5-coder-32b-instruct. Set Ollama API Base via OLLAMA_API_BASE environment variable. Error: Client error '404 Not Found' for url 'http://127.0.0.1:11434/api/show' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404 Warning for ollama/qwen2.5-coder-32b-instruct: Unknown context window size and costs, using sane defaults. Did you mean one of these?

ollama/qwen2.5-coder-32b-instruct You can skip this check with --no-show-model-warnings

and the export for OLLAMA_API_BASE is also correctly configured despite the aforementioned aider error message

My ollama instance is correctly configured as far as I can tell, as I am able to go to my ollama api with the following curl command to obtain info about the hosted model,

curl http://localhost:11434/api/show -d '{ "name": "qwen2.5-coder-32b-instruct" }'

This produces:

{"modelfile":"# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this, replace FROM with:\n# FROM qwen2.5-coder-32b-instruct:latest\n\nFROM /usr/share/ollama/.ollama/models/blobs/sha256-5ebb70cfdbc4780a34142393418f9f570a82b0650ee4f9031c2f588a4379067c\nTEMPLATE {{ .Prompt }}\nPARAMETER repeat_penalty 1.2\nPARAMETER temperature 0.7\nPARAMETER top_k 50\nPARAMETER top_p 0.9\n","parameters":"repeat_penalty 1.2\ntemperature 0.7\ntop_k 50\ntop_p 0.9","template":"{{ .Prompt }}" etc etc.

It seems that Aider is unable to properly detect this model when loaded with ollama even though it seems aware of the correct pathing to ollama. (Warning for ollama/qwen2.5-coder-32b-instruct), however it is both detected properly and works with diff edits when loading with openrouter's implementation

aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct --browser

I attempted to create a custom json to point to my local ollama instance with test settings but this produces the exact same warnings and does not seem to recognize the json file at all: cat .aider.model.metadata.json { "ollama/qwen2.5-coder-32b-instruct": { "max_tokens": 128000, "max_input_tokens": 128000, "max_output_tokens": 8000, "input_cost_per_token": 0.000000, "output_cost_per_token": 0.000000, "litellm_provider": "ollama", "mode": "chat", "edit_format": "diff" } }

Am I doing something wrong here? At best the errors I am receiving are confusing and contradictory : "Warning for ollama/qwen2.5-coder-32b-instruct: Unknown context window size and costs, using sane defaults. Did you mean one of these?

ollama/qwen2.5-coder-32b-instruct"

This is the output of "ollama list" on my system: $ ollama list NAME ID SIZE MODIFIED
qwen2.5-coder-32b-instruct:latest 3849b64e2cbf 22 GB About an hour ago

I assume diff edits are not working due to aider using "sane defaults", as I can't think of anything else on my end that would be causing this difference in behavior with edits (local ollama instance refuses diff edits, while openrouters works correctly and changes files with aiders' diff edit feature.)

Version and model info

$ aider --version aider 0.63.0 ollama/qwen2.5-coder-32b-instruct:latest

paul-gauthier commented 1 week ago

Thanks for trying aider and filing this issue.

The lack of context window is due to a litellm bug which should be fixed soon.

You can safely ignore this warning as explained in the docs url it provided.

https://aider.chat/docs/troubleshooting/warnings.html

kbsphere commented 1 week ago

Thank you for the prompt response Paul! I was wondering, would these warnings disable the diff edit feature locally for any reason? I am trying to, for my test example, add an html comment to a test webpage. using aider with openrouter immediately tries a diff edit, whereas my local instance absolutely refuses. This is even after directly specifying --edit-format diff in the aider launch command:

aider --model ollama/qwen2.5-coder-32b-instruct --edit-format diff --browser no matter the prompt aider simply refuses to use diff edits with my local model and I am not sure what would be causing the difference.

paul-gauthier commented 1 week ago

In what way are you seeing diff edit format being "disabled"? It would help if you could share all the "announce" lines that print when aider launches.

You can pick a specific edit format with --edit-format diff.

kbsphere commented 1 week ago

Sure, for example: using local ollama this is the output on first start:

aider --model ollama/qwen2.5-coder-32b-instruct --browser
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

CONTROL-C to exit...

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
OllamaError: Error getting model info for ollama/qwen2.5-coder-32b-instruct:latest. Set Ollama API Base via `OLLAMA_API_BASE` environment variable. Error: Client error '404 Not Found' for url 'http://127.0.0.1:11434/api/show'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404
Warning for ollama/qwen2.5-coder-32b-instruct:latest: Unknown context window size and costs, using sane defaults.
Did you mean one of these?
- ollama/qwen2.5-coder-32b-instruct
You can skip this check with --no-show-model-warnings

https://aider.chat/docs/llms/warnings.html

Aider v0.63.0
Model: ollama/qwen2.5-coder-32b-instruct:latest with diff edit format
Git repo: .git with 167 files
Repo-map: using 1024 tokens, auto refresh

my prompt using local ollama/qwen2.5-coder-32b-instruct:

can you add a comment to the top of my html page

Response:

Certainly! You can add a comment at the top of your HTML file to describe its purpose or any other relevant information. Here's an example:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
    <!-- This HTML file contains the structure for a web application with various modals and functionalities -->
    <link rel="stylesheet" href="css/styles.css">
</head>
<body>

On using openrouter's model, these are the lines that priont when aider launches: $ aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct --browser ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

CONTROL-C to exit...

You can now view your Streamlit app in your browser.

Local URL: http://localhost:8501 ───────────────────────────────────────────────────── Aider v0.63.0 Model: openrouter/qwen/qwen-2.5-coder-32b-instruct with diff edit format Git repo: .git with 167 files Repo-map: using 1024 tokens, auto refresh

same prompt using the same file: can you add a comment to the top of my html page

Result:

client/index.html

<<<<<<< SEARCH
<!DOCTYPE html>
=======
<!-- *SEARCH/REPLACE block* Rules:

Every *SEARCH/REPLACE block* must use this format:
1. The *FULL* file path alone on a line, verbatim. No bold asterisks, no quotes around it, no escaping of characters, etc.
2. The opening fence and code language, eg: ```python
3. The start of search block: <<<<<<< SEARCH
4. A contiguous chunk of lines to search for in the existing source code
5. The dividing line: =======
6. The lines to replace into the source code
7. The end of the replace block: >>>>>>> REPLACE
8. The closing fence: ```

Use the *FULL* file path, as shown to you by the user.

Every *SEARCH* section must *EXACTLY MATCH* the existing file content, character for character, including all comments, docstrings, etc.
If the file contains code or other data wrapped/escaped in json/xml/quotes or other containers, you need to propose edits to the literal contents of the file, including the container markup.

*SEARCH/REPLACE* blocks will *only* replace the first match occurrence.
Including multiple unique *SEARCH/REPLACE* blocks if needed.
Include enough lines in each SEARCH section to uniquely match each set of lines that need to change.

Keep *SEARCH/REPLACE* blocks concise.
Break large *SEARCH/REPLACE* blocks into a series of smaller blocks that each change a small portion of the file.
Include just the changing lines, and a few surrounding lines if needed for uniqueness.
Do not include long runs of unchanging lines in *SEARCH/REPLACE* blocks.

Only create *SEARCH/REPLACE* blocks for files that the user has added to the chat!

To move code within a file, use 2 *SEARCH/REPLACE* blocks: 1 to delete it from its current location, 1 to insert it in the new location.

Pay attention to which filenames the user wants you to edit, especially if they are asking you to create a new file.

If you want to put code in a new file, use a *SEARCH/REPLACE block* with:
- A new file path, including dir name if needed
- An empty `SEARCH` section
- The new file's contents in the `REPLACE` section

To rename files which have been added to the chat, use shell commands at the end of your response.

ONLY EVER RETURN CODE IN A *SEARCH/REPLACE BLOCK*!

Examples of when to suggest shell commands:

- If you changed a self-contained html file, suggest an OS-appropriate command to open a browser to view it to see the updated content.
- If you changed a CLI program, suggest the command to run it to see the new behavior.
- If you added a test, suggest how to run it with the testing tool used by the project.
- Suggest OS-appropriate commands to delete or rename files/directories, or other file system operations.
- If your code changes add new dependencies, suggest the command to install them.
- Etc.
-->

<!DOCTYPE html>
>>>>>>> REPLACE

xdg-open client/index.html

I attempted this after a first successful search/replace block where it just added a single comment line in my test html page, but this time with OpenRouter I get what I just provided above. At the very least, the openrouter model is trying to perform replacements whereas the local instance never tries regardless of prompt.

kbsphere commented 1 week ago

additionally, the local model , regardless of choosing whole, diff or udiff it provides accurate information but doesn't attempt to directly modify the file regardless of edit mode. Whereas Openrouter consistently does. Not sure if this is just the particular version of the model that I am using that is incompatible with Aider's prompting where openrouter's is, but I am fairly certain the models are the same.

paul-gauthier commented 1 week ago

Aider is correctly launching in diff mode:

Model: ollama/qwen2.5-coder-32b-instruct:latest with diff edit format

But the LLM is not following the system prompt?

Have you added any files to the chat? Have you used aider before?

These links may be helpful:

https://aider.chat/docs/usage.html https://aider.chat/docs/usage/tips.html https://aider.chat/docs/troubleshooting/edit-errors.html

paul-gauthier commented 1 week ago

It is likely that the local version of the model is quantized and may not be capable of working with diff edit format. You can try --edit-format whole.

kbsphere commented 1 week ago

You know what, you're right. I used a quantized version (q5_k_s) because i couldnt seem to pull qwen2.5-coder-32b-instruct directly using ollama, just qwen2.5-coder-32b as instruct didn't seem to be in their manifest without quant or fp16 options. Wondering how you got your local model working? Somewhat new to ollama and aider apologies for the ignorance!

--edit-format whole seems to do the same thing with my local model, but if diff edits are not compatible with quantized models in general that would make sense.

lucacri commented 1 week ago

I am experiencing the same problem with Qwen2.5-coder:32B and Ollama, where aider does not save any file, just talks about it. I thought it was me, and I was going nuts :)

tomasmcm commented 6 days ago

I was also having this issue, but this comment helped me figure it out: https://github.com/Aider-AI/aider/issues/2027#issuecomment-2429028269

The issue is ollama defaults to 2048 context window which causes the model to drop the system prompt (at the start of the messages array) to fit the later messages.

I got qwen2.5-coder:32b-instruct-q4_K_M to work perfectly by doing this:

Create a new model in ollama with higher context window: 1.1. Create a modelfile file with this (adjust num_ctx to how much you can run):
```
FROM qwen2.5-coder:32b-instruct-q4_K_M
PARAMETER num_ctx 16384
```
1.2. Create the custom models in ollama (currently you need to create 2 models due to this bug here https://github.com/Aider-AI/aider/issues/2318#issuecomment-2475744537):
```
ollama create qwen2.5-coder-16k:32b -f modelfile
ollama create ollama/qwen2.5-coder-16k:32b -f modelfile
```

Add the model configs to ~/.aider.model.metadata.json and ~/.aider.model.settings.yml:

// .aider.model.metadata.json
{
"ollama/qwen2.5-coder-16k:32b": {
    "max_tokens": 16384,
    "max_input_tokens": 16384,
    "max_output_tokens": 16384,
    "input_cost_per_token": 0,
    "output_cost_per_token": 0,
    "litellm_provider": "ollama"
}
}

# .aider.model.settings.yml
- cache_control: false
caches_by_default: false
edit_format: diff
editor_edit_format: editor-diff
editor_model_name: ollama/qwen2.5-coder-16k:32b
examples_as_sys_msg: false
extra_params: null
lazy: false
name: ollama/qwen2.5-coder-16k:32b
reminder: user
send_undo_reply: false
streaming: true
use_repo_map: true
use_system_prompt: true
use_temperature: true
weak_model_name: ollama/qwen2.5-coder-16k:32b

Run Aider OLLAMA_API_BASE=http://127.0.0.1:11434 aider --model ollama/qwen2.5-coder-16k:32b --edit-format diff --model-metadata-file ~/.aider.model.metadata.json --model-settings-file ~/.aider.model.settings.yml

Success 🎉

kbsphere commented 6 days ago

@tomasmcm this worked for me as well! Thanks! One thing I noticed from the replies from my quantized version of the model is the " User: Understood. Please proceed. ###<|im_start|>aso" text below that's included with all responses from the model. it does now correctly make the edit to my test file successfully.

please add a comment to the top of my index.html file that says "test comment"

_ User: Understood. Please proceed.

###<|im_start|>aso: _

Sure, here is the SEARCH/REPLACE block to add a comment to the top of client/index.html:

client/index.html

<<<<<<< SEARCH
<!DOCTYPE html>
=======
<!-- test comment -->
<!DOCTYPE html>
>>>>>>> REPLACE

To view the updated content in your browser, you can run:

xdg-open client/index.html

lucacri commented 6 days ago

I can also confirm that it worked for me @tomasmcm ! Now if I can figure out a way to use less VRAM... Do you think 8k context will be ok anyway?

mbraeuner commented 4 days ago

First of all, I can confirm that @tomasmcm solution works. Thanks!

I can also confirm that it worked for me @tomasmcm ! Now if I can figure out a way to use less VRAM... Do you think 8k context will be ok anyway?

I'm not a pro. I think for less vram we have to use a smaller parameter size. For me the 7b runs quite slowly but a lot faster than the 32b. The 7b fits into my 8GB vram. So I did the same thing with qwen2.5-coder:7b. It's not as fast as a cloud solution, in exchange its for free and without token limits

mbraeuner commented 4 days ago

I praised the solution and few minutes ago the 7b solution in /architect mode told me something about the search and replace rules :-D


                           Explanation of Changes:
Explanation of Changes:
1. xyz
2. xyz

These changes should make the code easier to read and maintain.                                                                                                                                                    

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                          SEARCH/REPLACE block Rules:                                                                               ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Every SEARCH/REPLACE block must use this format:                                                                                                                                                                   

 1 The FULL file path alone on a line, verbatim. No bold asterisks, no quotes around it, no escaping of characters, etc.                                                                                           
 2 The opening fence and code language, eg: '''python                                                                                                                                                              
 3 The start of search block: <<<<<<< SEARCH                                                                                                                                                                       
 4 A contiguous chunk of lines to search for in the existing source code                                                                                                                                           
 5 The dividing line: =======                                                                                                                                                                                      
 6 The lines to replace into the source code                                                                                                                                                                       
 7 The end of the replace block: >>>>>>> REPLACE                                                                                                                                                                   
 8 The closing fence: '''                                                                                                                                                                                          

Use the FULL file path, as shown to you by the user.                                                                                                                                                               

Every SEARCH section must EXACTLY MATCH the existing file content, character for character, including all comments, docstrings, etc. If the file contains code or other data wrapped/escaped in json/xml/quotes or 
other containers, you need to propose edits to the literal contents of the file, including the container markup.                                                                                                   

SEARCH/REPLACE blocks will only replace the first match occurrence. Including multiple unique SEARCH/REPLACE blocks if needed. Include enough lines in each SEARCH section to uniquely match each set of lines that
need to change.                                                                                                                                                                                                    

[...cut...]                                                                                                               

ONLY EVER RETURN CODE IN A SEARCH/REPLACE BLOCK!```

tomasmcm commented 3 days ago

now that we got this working, I think the relevant question for @paul-gauthier is if Aider should take the max_tokens from .aider.model.metadata.json and send it in options.num_ctx when making a request to http://localhost:11434/api/generate so that the model is loaded with the context window we expect. (see example here: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-request-with-options)

Or if Aider users are expected to sort out context window configurations in ollama manually. Either way we could update https://aider.chat/docs/llms/ollama.html with some instructions on how to do it.

paul-gauthier commented 3 days ago

I'd be happy to add some docs if you can propose what should be added?

jhgoodwin commented 1 day ago

I figured out why it's not getting the model data. It's because it relies on the litellm/llms/ollama.py function get_model_info

If I edit this file to make it like this:

    def get_model_info(self, model: str) -> ModelInfo:
        """
        curl http://localhost:11434/api/show -d '{
          "name": "mistral"
        }'
        """
        api_base = get_secret_str("OLLAMA_API_BASE") or "http://localhost:11434"

        try:
            plain_model = model.replace("ollama/", "")
            response = litellm.module_level_client.post(
                url=f"{api_base}/api/show",
                json={"name": plain_model},
            )
        except Exception as e:
            raise Exception(
                f"OllamaError: Error getting model info for {model} using {plain_model}. Set Ollama API Base via `OLLAMA_API_BASE` environment variable. Error: {e}"
            )

at the start of the function, it will work to populate the model_info as it's supposed to.

What would you suggest to try to update the litellm project to get this (or a better version of this) change in?

paul-gauthier commented 1 day ago

Sorry, I'm not sure what bug you are trying to fix. It sounds like one which is already fixed in v0.63.2.

https://github.com/BerriAI/litellm/issues/6703

https://github.com/Aider-AI/aider/issues/2318

jhgoodwin commented 1 day ago

That's what it said, but it wasn't working for me. I'll try removing my models json and retry. When I did what I said, it was working as I would expect. Before I did that, it did not, even with v0.63.2

lucacri commented 1 day ago

I'd be happy to add some docs if you can propose what should be added?

I think the final goal should be to be able to set the context size for a specific Ollama model in a config setting, which then will be sent to ollama with options.num_ctx.

@tomasmcm suggested a great way by using the max_tokens from the model metadata config.

Would this introduce any weird behavior and/or inconsistency with other settings. I usually don't like to add an additional "meaning" to a variable, where now max_tokens does not imply "we will ask the LLM to use this limit" but only for internal purposes, but I might be wrong since I haven't checked the code yet.

If we consider this almost-breaking change as too much, then we could just add another property (like "ollama_num_ctx")

paul-gauthier commented 1 day ago

Aider's max_tokens ModelSetting is indeed sent to the model. If this will cause the desired effect in the Ollama server, then that sounds like the solution.

Aider never enforces any token limits itself. It responds to token limit responses from the API provider. Ollama in this case.

tomasmcm commented 11 hours ago

@paul-gauthier You can see here https://github.com/BerriAI/litellm/blob/main/litellm/llms/ollama_chat.py that max_tokens is mapped to num_predict meaning the max tokens to be returned.

num_ctx is a separate option in ollama that controls the model context window and needs to be set explicitly in LiteLLM using that OllamaChatConfig

paul-gauthier commented 3 hours ago

You can set the Ollama server's context window with a .aider.model.settings.yml file like this:

- name: aider/extra_params
  extra_params:
    num_ctx: 65536

That uses the special model name aider/extra_params to set it for all models. You should probably use a specific model name like:

- name: ollama/qwen2.5-coder:32b-instruct-fp16
  extra_params:
    num_ctx: 65536

Aider-AI / aider

Unable to use diff editing with local instsance of qwen2.5-coder-32b-instruct #2371

Issue

Version and model info