All-Hands-AI / OpenHands

🙌 OpenHands: Code Less, Make More
https://all-hands.dev
MIT License
37.41k stars 4.24k forks source link

Create a competitive agent with open LLMs #1085

Open neubig opened 7 months ago

neubig commented 7 months ago

What problem or use case are you trying to solve?

Currently OpenDevin somewhat works with the strongest closed LLMs such as GPT-4 or Claude Opus, but we have not confirmed good results with open LLMs that can be run locally. We would like to create a formula to achieve competitive results with local LMs.

Do you have thoughts on the technical implementation?

This will require a strong (perhaps fine-tuned) coding agent LLM. It will probably have to be tuned based on strong code LMs such as CodeLlama, StarCoder, DeepseekCoder, or some other yet-to-be-released LLM.

rezzie-rich commented 7 months ago

The user should be able to choose single or multiple LLM to power all the agents. For example, mixtral could power the generalized agents while deepseekcoder power the code generating agents, and white-rabbit-neo could power the testing/cybersecurity agents. This way, only one LLM will be active at a time as per the active agent, and multiple niche specific open LLM could collaborate to outperform private LLMs like gpt-4 while running locally on consumer grade hardware.

JayQuimby commented 7 months ago

I think the models need to be "self-prompting"

From the experience I have had with OpenDevin there are a lot of times it gets close to doing the thing that I want it to but it falls short of the goal and then just starts either repeating the same command or will just do something random.

It would be interesting to use two distinct prompting strategies so that the model effectively has a conversation with itself. The first prompt would be something along the lines of looking at its previous actions and the goal and coming up with a plan for the next action it could take. Then the second prompt would be getting the agent to perform an action based on the thoughts provided by the response to the first prompt.

I think this would offer the agent more flexibility and it would give it more ability to guide itself towards a better in context solution than any static prompt template can. the downside is that you need to have two model queries per action you take instead of one.

Also, Microsoft just released wizardLM 2 and it is way better than anything I have tried local so far.

chrisbraddock commented 7 months ago

gpt-pilot is quite good at this. Try it out to get an idea. I think there are planner and reviewer agents for each step.

I kind of wish OpenDevin incorporated gpt-pilot for the engine.

Jiayi-Pan commented 6 months ago

A nice way to improve open-source LLMs is by fine-tuning them with trajectories from stronger models like GPT-4. Bonus point if we can filter out the bad ones.

One way to achieve this at scale, similar to wildchat, is to provide officially hosted open-devin interfaces that come with a free GPT-4 based backend. In exchange for freely using these agents, users need to sign up to allow free distribution of the data and rank the quality of the agents' performance for us.

I imagine this could be used to:

  1. Obtain diverse, high-quality trajectories to fine-tune open agents.
  2. As a easy to start demo, attract more users
  3. Potentially use human preference data to create a Chatbot Arena equivalent for coding agent
xingyaoww commented 6 months ago

Thanks @Jiayi-Pan!! All of the bullet points mentioned are actually on our roadmap :))

Jiayi-Pan commented 6 months ago

Thanks @Jiayi-Pan!! All of the bullet points mentioned are actually on our roadmap :))

Amazing and thanks for the pointer! I will have a look and see what I can contribute

xingyaoww commented 6 months ago

@Jiayi-Pan We are currently thinking about re-purposing the existing agent tuning dataset (e.g., code, agent tuning) for (1) so we can have a preliminary v0.1 OSS model :)

BradKML commented 5 months ago

Also does this feel like a technical foundation for building fine-tuning tool kits through generating quasi-synthetic data?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

neubig commented 2 months ago

We're still working on this!

dorbanianas commented 2 months ago

Hey @neubig , sorry for being late I was a bit busy these days and I was working on a small version but I had some resource limitations so I didn't progress.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

BradKML commented 2 weeks ago

@Jiayi-Pan here is a bit of a leading question:

  1. What would the architecture of competitive coding LLM Arena look like? Would they be allowed to run their code multiple times to debug (and without limit for paywalled models)? Which type of judging criteria should we prioritize (code runtime vs code generation and debug time)?
  2. What would the architecture of a fine-tuning dataset generator look like? Should we include every single coding problem along side codebase debugging problems? Should we include diverse programming languages (including ones that would have memory issues)? Should we mix pure implementations with library use?
  3. (on a meta-level) Will the LLM be allowed to self-document programming methodologies (e.g. DS&A, design patterns, ML knowledge) between different mock-benchmarks? If so, then where would the mock-benchmark be sourced from that is unique from the core dataset that will be used to be compared with other SWE architectures?
  4. (bonus question) how would Chain-of-Thought and other adjacent architecture be handled? This could be different from just picking LLM architectures where it is the token that are being predicted, and instead where to turn token outputs back into inputs https://arxiv.org/html/2401.14295v3