How to Extract Complete, Non-redundant, and Correct Code from Messages Testing on Benchmarks like HumanEval?

OpenInterpreter / open-interpreter

A natural language interface for computers

http://openinterpreter.com/

GNU Affero General Public License v3.0

55.29k stars 4.82k forks source link

How to Extract Complete, Non-redundant, and Correct Code from Messages Testing on Benchmarks like HumanEval? #1216

Open huoliangyu opened 6 months ago

huoliangyu commented 6 months ago

Is your feature request related to a problem? Please describe.

No response

Describe the solution you'd like

Hello,

I am exploring the effectiveness of open-interpreter on benchmarks like HumanEval and have encountered some challenges with the code generation process. Specifically, I've noticed that sometimes the interpreter only plans but does not generate actual code, and sometimes, the generated code contains errors and requires multiple modifications.

Could you please advise on how best to extract complete, non-redundant, and correct code from messages to automatically test on HumanEval?

Thank you!

Describe alternatives you've considered

No response

Additional context

No response

Steve235lab commented 6 months ago

Except the "non-redundant", your requirements can be done with some well designed custom instructions. You need to tell the LLM the expected way to response. However, the "non-redundant" is conflict with correct in most cases because of the limited ability of current LLMs, they need to debug their code for several times to give a final correct version, just like normal human programmers, which means there are always some redundant code in the conversations history. You may need to find a way by yourself to filter the history to get the final correct code.

huoliangyu commented 6 months ago

Except the "non-redundant", your requirements can be done with some well designed custom instructions. You need to tell the LLM the expected way to response. However, the "non-redundant" is conflict with correct in most cases because of the limited ability of current LLMs, they need to debug their code for several times to give a final correct version, just like normal human programmers, which means there are always some redundant code in the conversations history. You may need to find a way by yourself to filter the history to get the final correct code.

Thank you for your quick response! Could you suggest any suitable prompt templates or methods for extracting code to test the open-interpreter's performance on HumanEval? In my tests (where I've designed prompts to ensure the agent always outputs code), the performance of GPT-3.5 with open-interpreter seems somewhat inferior compared to using GPT-3.5 directly. Any good advice would be greatly appreciated!

Steve235lab commented 6 months ago

GPT-3.5 is provided as a RESTful API by OpenAI, so I don't really know what "using GPT-3.5 directly" means? Curl the API directly? If you mean the ChatGPT from OpenAI, then as long as the system prompts of ChatGPT are property of OpenAI, it's hard to compose something better than that. There are some tricks on the Internet teaching you how to get the system prompts of ChatGPT, maybe you can try that.

Steve235lab commented 6 months ago

By the way, the default embedded system prompt of OI may not be suitable for your task, somehow it focuses too much on telling the LLM how to parse special message types of OI. If custom instructions can't solve your problem, you can try to modify the embedded system prompt in OI source code.

huoliangyu commented 6 months ago

Thank you for your reply, I will try these methods and look forward to OI updating continuously.