Aider-AI / aider

aider is AI pair programming in your terminal
https://aider.chat/
Apache License 2.0
22.18k stars 2.06k forks source link

Prompting strategies #110

Closed tmm1 closed 1 year ago

tmm1 commented 1 year ago

Thanks for all your work investigating and benchmarking various prompting strategies.

https://github.com/paul-gauthier/aider/issues/20#issuecomment-1620257701

I was curious if there are other strategies you tried and how well they worked. Ideally an LLM could generate unified diffs directly, but that seems quite challenging, even for GPT4.

I plan to experiment with a simple line based strategy, but wanted to run it by you in case you had thoughts. If all file listings had <line no>: prefixes on each line, I think a model could be asked to return only lines it wanted to change. This would reduce a lot of the repetition still seen with the block diff strategy and gpt4.

tmm1 commented 1 year ago

I played a bit with this today, and had some success with GPT4. I discovered that if I ask for a range of line numbers at the top of an edit-block, they are usually off by one. But asking for it at the end of the block after the changed lines, it works more reliably.

IMG_0321

EDIT: my prompt changes are here

paul-gauthier commented 1 year ago

Nice, that looks promising. I am exploring something similar on a branch myself right now.

I have previously experimented with line numbers (and unified diff, and a bunch of other things). Most of that work predates the benchmarking suite, so I was working off anecdotal evidence and best intuitions. I can't recall the specific observations that moved me away from line numbers and towards diffs at that time.

I'll share some results/learnings once I find the time to make progress on my line number branch.

tmm1 commented 1 year ago

To run the benchmarks, do I need to clone https://github.com/exercism/python into tmp.benchmarks/exercism-python?

I built the docker and then ran rungrid inside but see this error:

FileNotFoundError: [Errno 2] No such file or directory: '/benchmarks/2023-07-17-19-59-35--rungrid-gpt-3.5-turbo-0613-whole-repeat-1/exercises/.docs/instructions.md'

EDIT: Figured out I need to copy the practice folders in, cp -a ~/code/exercism-python/exercises/practice tmp.benchmarks/exercism-python

tmm1 commented 1 year ago

I experimented a bit more with line-based editing today. Asking gpt4 to emit simple code blocks, with instructions immediately after to replace/insert-above/insert-below/delete seems to work well, and allow the model to make changes to multiple places very efficiently:

IMG_0323

However, it's challenging to apply these changes because as soon as you apply the first the line numbers which affect the next will change.

paul-gauthier commented 1 year ago

However, it's challenging to apply these changes because as soon as you apply the first the line numbers which affect the next will change.

Interesting progress! Can you apply them highest->lowest line numbers?

paul-gauthier commented 1 year ago

EDIT: Figured out I need to copy the practice folders in, cp -a ~/code/exercism-python/exercises/practice tmp.benchmarks/exercism-python

Yes, that's correct.

Sorry the benchmarking tools are not packaged up very well. I wasn't expecting anyone other than me to use them. I will try and put some effort into polishing them up, adding a README, etc.

Be warned that it costs about $15 to process all 133 exercises with aider when using the gpt-4 model.

tmm1 commented 1 year ago

Sorry the benchmarking tools are not packaged up very well. I wasn't expecting anyone other than me to use them. I will try and put some effort into polishing them up, adding a README, etc.

Be warned that it costs about $15 to process all 133 exercises with aider when using the gpt-4 model.

No worries, it wasn't hard to figure out how to run the various scripts but some docs would be great.

I was able to run the "whole" tests for one iteration using gpt-3.5, for $1.72. I got a similar result to you, ~54% iirc

I then tried to run the "whole" benchmarks against LocalAI + WizardCoder and StarCoderPlus, but neither were able to follow aider's instructions well enough to score above 0%. I will try again with llama2 chat models, and also want to try fine tuning some open-source models using aider's prompt syntax (hence the investigation into prompting strategies before I invest in building a training dataset).

Interesting progress! Can you apply them highest->lowest line numbers?

That's clever!

Unfortunately I'm souring on the line number approach. It seems very finicky and hard to replicate reliably. I will experiment with prompts more, maybe if we ask GPT4 to reason step-by-step further that will help. Instead of just asking for a list of files first, maybe a list of line ranges and operations decided upfront would also lead to better results.

I'm also considering using something like gpt-prompt-engineer

paul-gauthier commented 1 year ago

That's really interesting that you were able to benchmark against LocalAI. Thanks for doing that and sharing the (disappointing) outcome.

I can't say I am super surprised. The Claude models are the only ones I've seen discussed as potentially rivaling OpenAI's wrt to coding tasks.

I've had similar results from experiments with line number based edit formats. It's surprisingly difficult to get reliable edits from even GPT-4.

tmm1 commented 1 year ago

My hunch is that once GPT4 allows fine-tuning, reliability could be improved drastically with a training set of examples using line number edits.

I played a bit with Llama2 7B chat (llama-2-13b-chat.ggmlv3.q4_K_M.bin) and the results are very encouraging. It manages to output edit blocks sometimes, and can handle whole file listings with relative ease. Fine tuning may push it over the edge enough to be useful. I will try to run the benchmarks and report the results.

tmm1 commented 1 year ago

I have been experimenting with llama2 and managed to wire up aider to it.

If you have an nvidia GPU and CUDA environment, you can do this very simply:

pip3 install git+https://github.com/lm-sys/FastChat@main
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf

then use aider --openai-api-base http://localhost:8000/v1

funkytaco commented 1 year ago

@tmm1 if you use llama2 does that mean your api calls would be free? I see mention of openai.api.server which is what has me wondering (that's probably just the misnamed env setting?)

tmm1 commented 1 year ago

Yes all local and free. The vllm project has an openai compatible api module I'm using here.

I've also been able to use text-generation-webui's openai extension and will document those commands later today.

I think GGML will be the most portable way to integrate so I'm working through that now. I'm able to simulate an aider session prompt using llama.cpp chat cli and have been using that to test responses.

tmm1 commented 1 year ago

I've also been able to use text-generation-webui's openai extension and will document those commands later today.

text-generation-webui

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
pip install -r extensions/openai/requirements.txt

then, if using CUDA:

python3 download-model.py TheBloke/Llama-2-13B-chat-GPTQ
python3 server.py --listen --extensions openai --loader exllama --model TheBloke_Llama-2-13B-chat-GPTQ

otherwise to run on CPU:

python3 download-model.py TheBloke/Llama-2-13B-chat-GGML
python3 server.py --listen --extensions openai --loader llamacpp --model TheBloke_Llama-2-13B-chat-GGML

the openai compatible api will be listening on http://0.0.0.0:5001/v1

Some small changes (de1616eb314f9fbb16c3b5d1bebcef1ecc298e42) are required to get aider working. I will try to clean them up and send some PRs.

funkytaco commented 1 year ago
python3 server.py --listen --trust-remote-code --extensions openai --loader llamacpp --model TheBloke_Llama-2-13B-chat-GGML

I thought it doesn't need openai? This doesn't seem to be working on mac m1 either

funkytaco commented 1 year ago

I think maybe the model was too much for my m1 2020. I moved out a lot of the .bin files and used TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q4_K_S.bin.

It's really slow, but it works, so i guess I need to get a GPU.

Ichigo3766 commented 1 year ago

@tmm1 Would you try doing this with Huggingface Text-Generation-Inference server?. They have a langchain wrapper which can be utilized to connect with the llm. Unfortunately theres no openai style api for them but more of this:

from langchain.llms.huggingface_text_gen_inference import HuggingFaceTextGenInference

llm = HuggingFaceTextGenInference( inference_server_url='', top_k=10, top_p=0.7, temperature=0.01, stream=True, repetition_penalty=1.1, max_new_tokens=6000,

callbacks=[MyCustomHandler()]

        callbacks=[streaming_stdout.StreamingStdOutCallbackHandler()]

)

This would require some work but would be sick as TGI is a production level api and is very fast! I have been running this and using a coding model alongside with aider would be a huge deal!

tmm1 commented 1 year ago

How fast is TGI for you vs other methods? In my testing the exllama and vllm were the fastest, but I didn't try TGI. Unfortunately without an openai compatible api it will not be straightforward to integrate.

I had some issues with text-generation-webui when used with GGML (with GPTQ + exllama it works really great). I would recommend using LocalAI instead if you need GGML:

git clone https://github.com/go-skynet/LocalAI
cd LocalAI
make build
cd models
wget --continue https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_K_M.bin
cat > llama2.yaml <<EOF
name: llama2
backend: llama
context_size: 2048
max_tokens: 512
parameters:
  model: llama-2-13b-chat.ggmlv3.q4_K_M.bin
  temperature: 0.6
  top_k: 80
  top_p: 0.7
template:
  chat_message: llama2-chat
EOF
cat > llama2-chat.tmpl <<EOF
{{if eq .RoleName "assistant"}}{{.Content}}{{else}}
[INST]
{{if .SystemPrompt}}{{.SystemPrompt}}{{else if eq .RoleName "system"}}<<SYS>>{{.Content}}<</SYS>>

{{else if .Content}}{{.Content}}{{end}}
[/INST] 
{{end}}
EOF
cd ..
./local-ai --debug
aider --model llama2 --openai-api-base http://localhost:8080/v1 --edit-format whole
Ichigo3766 commented 1 year ago

TGI is crazy fast. Now they support gptq as well using exllama kernal. VLLM is not as fast but yea it has open ai api.

tmm1 commented 1 year ago

I tested TGI and it was much slower than using exllama directly. Not sure why. Maybe it's more tuned for A100 vs 3090 that I'm using.

That said, the API for TGI is very simple (https://huggingface.github.io/text-generation-inference/) and there's a python wrapper too (pip install text_generation), so perhaps it would be worth adding support for that as a backend to aider.

Ichigo3766 commented 1 year ago

Interesting. But exllama is not a production ready server and it may be fast for just one person but if you were to use it in multiple user scenario, TGI is way ahead due to the sharding mechanism. Overall, it would be nice to have that and many people are using it so would be cool to use wizardcoder with aider using TGI. @paul-gauthier

paul-gauthier commented 1 year ago

FYI, we just added an #llm-integrations channel on the discord, as a place to discuss using aider with alternative or local LLMs.

https://discord.gg/X9Sq56tsaR

paul-gauthier commented 1 year ago

Closing this for now. See https://github.com/paul-gauthier/aider/issues/172 for more info.