Open huoliangyu opened 6 months ago
Except the "non-redundant", your requirements can be done with some well designed custom instructions. You need to tell the LLM the expected way to response. However, the "non-redundant" is conflict with correct in most cases because of the limited ability of current LLMs, they need to debug their code for several times to give a final correct version, just like normal human programmers, which means there are always some redundant code in the conversations history. You may need to find a way by yourself to filter the history to get the final correct code.
Except the "non-redundant", your requirements can be done with some well designed custom instructions. You need to tell the LLM the expected way to response. However, the "non-redundant" is conflict with correct in most cases because of the limited ability of current LLMs, they need to debug their code for several times to give a final correct version, just like normal human programmers, which means there are always some redundant code in the conversations history. You may need to find a way by yourself to filter the history to get the final correct code.
Thank you for your quick response! Could you suggest any suitable prompt templates or methods for extracting code to test the open-interpreter's performance on HumanEval? In my tests (where I've designed prompts to ensure the agent always outputs code), the performance of GPT-3.5 with open-interpreter seems somewhat inferior compared to using GPT-3.5 directly. Any good advice would be greatly appreciated!
GPT-3.5 is provided as a RESTful API by OpenAI, so I don't really know what "using GPT-3.5 directly" means? Curl the API directly? If you mean the ChatGPT from OpenAI, then as long as the system prompts of ChatGPT are property of OpenAI, it's hard to compose something better than that. There are some tricks on the Internet teaching you how to get the system prompts of ChatGPT, maybe you can try that.
By the way, the default embedded system prompt of OI may not be suitable for your task, somehow it focuses too much on telling the LLM how to parse special message types of OI. If custom instructions can't solve your problem, you can try to modify the embedded system prompt in OI source code.
Thank you for your reply, I will try these methods and look forward to OI updating continuously.
Is your feature request related to a problem? Please describe.
No response
Describe the solution you'd like
Hello,
I am exploring the effectiveness of open-interpreter on benchmarks like HumanEval and have encountered some challenges with the code generation process. Specifically, I've noticed that sometimes the interpreter only plans but does not generate actual code, and sometimes, the generated code contains errors and requires multiple modifications.
Could you please advise on how best to extract complete, non-redundant, and correct code from messages to automatically test on HumanEval?
Thank you!
Describe alternatives you've considered
No response
Additional context
No response