FudanSELab / ClassEval

Benchmark ClassEval for class-level code generation.
MIT License
108 stars 5 forks source link

Lower test passing rates compared to original results. #7

Open holen-zhang opened 3 months ago

holen-zhang commented 3 months ago

Hi, Thank you very much for sharing this benchmark and all the hard work! I have a question regarding the test passing rates on the generated code. I followed the steps indicated for the evaluation and just executed the tests on the predicted code from the dataset but I found that the test passing rates are much lower than the clarified ones. For example, the test passing rates (class level) on the code generated in GPT-4-Turbo_class_H_greedy is only 7% on my side while the original one is 38%. I wonder whether do I need to configure something (e.g., environments) for the test cases to be successfully executed. I would appreciate it if you can shed a light on this :-)

wkx228 commented 2 months ago

Could you please provide more details (e.g. the predicted code of the model or the command line output)? Besides, I’m not sure whether you are evaluating using your own generated GPT-4 prediction results or using the results we provided. If it’s the former, you can try to use the GPT-4-Turbo_class_H_greedy.json file in our repository to evaluate and compare the results.