No groundtruth code and test unit cases

sinsauzero commented 6 months ago

I see your data only contains prompt, without unitcases and GT code in it. Do you have plan to update it?

1e0ndavid commented 6 months ago

Hello, thank you for your interest in our work! To prevent data leakage, we will temporarily release only the prompt part of the test data. If you would like to evaluate your model, please click "File a request" on the leaderboard page and organize the results in a jsonl file (please ensure it includes the question ID, function name, model response, and other necessary fields), and then upload the model results by filing an issue. We will calculate and return the detailed results to you.

您好，感谢对我们工作的关注！以防数据泄露，我们暂时只会放出测试数据的prompt部分，如果想要评估您的模型，欢迎您点击leaderboard中的File a request，将结果整理在jsonl文件中后（请确保包含题目id、函数名、模型回复等必要字段），再用提issue的方式上传模型结果，我们会将详细结果计算后返回给您。

YerbaPage commented 5 months ago

Hello, thank you for your interest in our work! To prevent data leakage, we will temporarily release only the prompt part of the test data. If you would like to evaluate your model, please click "File a request" on the leaderboard page and organize the results in a jsonl file (please ensure it includes the question ID, function name, model response, and other necessary fields), and then upload the model results by filing an issue. We will calculate and return the detailed results to you.

您好，感谢对我们工作的关注！以防数据泄露，我们暂时只会放出测试数据的prompt部分，如果想要评估您的模型，欢迎您点击leaderboard中的File a request，将结果整理在jsonl文件中后（请确保包含题目id、函数名、模型回复等必要字段），再用提issue的方式上传模型结果，我们会将详细结果计算后返回给您。

Thanks for your hard work on this excellent project. I'm wondering if there are any plans to release the test cases (GT codes are not necessary, to avoid being collected into some training sets)?

I'm working on methods to improve the model's ability for solving more difficult problems than those in HumanEval. Including these test cases would not only enable a more straightforward and efficient evaluation process for us, but also likely increase the impact and recognition of your valuable work.

1e0ndavid commented 5 months ago

Thank you for your interest in our work. We have decided to release 3 test cases for each problem. We are not sure if this will suffice, but please check the MHPP.jsonl file in the data directory. Should you have any further questions, feel free to ask.

YerbaPage commented 5 months ago

Thanks for your quick reply and help! I think the 3 test cases released will be beneficial enough for us.

NTDXYG commented 5 months ago

请问能否获取到完整的测试用例？我测试了一下，用给出的三个测试用例计算出的pass@1会比论文中提及的pass@1分数高很多。本地跑实验每次都向官网提交测评结果似乎不太方便。

1e0ndavid commented 5 months ago

您好，感谢关注！

首先，如果数值偏高，请问您是否使用了主表中相同的设置呢，比如temperature为0.7，我们为了减少偏差在此开高了temperature进行大量采样。在代码任务上我们常观察到greedy的效果确实更好，您可以参考Appendix中的table5，为greedy的结果，可以观察到此时GPT4 total pass@1为53.6%，但table2中为47.8%，确实可能会存在您说的情况。

其次，模型生成结果后对代码的后处理也会有一定差异，我们不能保证我们现有脚本对代码提前对完全准确性，此处也会有一定偏差。

最后，您会认为说如果我们搭建一个自动化流程，您只需要提交issue或通过其他http请求的方式立刻获取到结果，您会认为这样方便吗？如果这种方式可接受我们可以尝试一下，如果实在不行，我才需要和其他作者们讨论一下放出更多测试用例的可能性。

Below is the text translated by ChatGPT for English speakers to read:

Hello, thank you for your interest!

Firstly, if the values are higher than expected, may I ask if you have used the same settings as in the main table, such as a temperature of 0.7? To reduce discrepancies, we increased the temperature for extensive sampling. In code tasks, we often observe that the greedy method indeed performs better. You can refer to Table 5 in the Appendix, which shows the results for greedy; here, the GPT-4 total pass@1 rate is 53.6%, whereas it is 47.8% in Table 2, which could indeed be as you described.

Secondly, there will also be some differences in the post-processing of the code generated by the model. We cannot guarantee that our current scripts will completely accurately refine the code, so there may be some deviations here as well.

Lastly, would you find it convenient if we set up an automated process where you only need to submit an issue or receive results immediately via other HTTP requests? If this approach is acceptable, we could give it a try. If it's not feasible, I would need to discuss with the other authors the possibility of releasing more test cases.

NTDXYG commented 5 months ago

我在推理的时候用的是greedy search。我建议可以搭建自动化流程，这样确实方便，感谢作者！

1e0ndavid commented 5 months ago

Hi, I have built a simple pipeline to automatically evaluate a JSONL file containing the required elements, and I've just updated the README.md. Now, you can follow the instructions in the Quick Start section to evaluate your results. Give it a try!

老哥，我搭建了一个简单流程，你可以按照README里的指引试试看！欢迎反馈问题~

NTDXYG commented 5 months ago

Hi, I have built a simple pipeline to automatically evaluate a JSONL file containing the required elements, and I've just updated the README.md. Now, you can follow the instructions in the Quick Start section to evaluate your results. Give it a try!

老哥，我搭建了一个简单流程，你可以按照README里的指引试试看！欢迎反馈问题~

牛的

SparksofAGI / MHPP

No groundtruth code and test unit cases #1