Open Teskh opened 2 months ago
LLM generated TL; DR:
Original rant: As a local-first user this looks very interesting. If I understood correctly, running multiple request in parallel is very efficient. I wonder how well this would work with smaller models like codestral or even 8b coding models.
It shouldn't be too difficult to get a POC of this plugged into Aider. And it's an excuse to finally play with a batching-capable serving engine. Now to find some time...
Edit: just remembered Wilmer, which might be a good option to avoid making (too many) changes to Aider. Individual tools should be good at their one specific task and all that. https://github.com/SomeOddCodeGuy/WilmerAI/
Rabbit hole edit: I was thinking about using grammars to force the model to stick to the requested format better. I'm not sure how GGUF does it, but with exllamav2 it looks very feasible to implement this with schemas, output limiting. Aider already evaluates the output pretty well. https://github.com/turboderp/exllamav2/blob/master/examples/inference_json.py
Note how we could, for example, build a Literal list of possible file paths.
Then during the coder phase we could ban things like "rest of code" (there's also an example for that). It might even be possible to loop when a phrase is banned until all possible variations of an unwanted phrase are banned. Unfortunately some (most maybe?) models are so stubborn that they'll just output gibberish when there is no embedding available for what they intend to say. I can only think of two ways around this:
Exllamav2 also has some batching examples. I'm excited.
Issue
cheap models (like 4o-mini) seem to get comparable performance to SOTA at cheaper price when n simultaneous attempts are run in parallel (each attempt gets fed into itself with the prompt "make it better" m times). An evaluator then picks the output it likes best, which is then applied.
Reference: https://www.youtube.com/watch?v=0Z2BQPuUY50&t=979s
pd: loving aider. thanks a lot.
Version and model info
Aider v0.45.1