FudanSELab / ClassEval

Benchmark ClassEval for class-level code generation.
MIT License
108 stars 5 forks source link

Pass@1 greedy results are changing whenever I re-evaluate #9

Open sfc-gh-hhan opened 1 month ago

sfc-gh-hhan commented 1 month ago

I'm using this command to evaluate Pass@1:

$ python evaluation.py --source_file_name GPT-4-Turbo_class_H_greedy --eval_data ClassEval_data --greedy 1
{
'class_partial_success': 0.58
'class_success': 0.37
'fun_partial_success': 0.8047808764940239
'fun_success': 0.6613545816733067
}

After rerun:

{
'class_partial_success': 0.58
'class_success': 0.36
'fun_partial_success': 0.8047808764940239
'fun_success': 0.6593625498007968
}
DXY-lemon commented 2 days ago

The issue you observed might be due to two main factors: the recent update to GPT-4 and the ongoing updates to our benchmarks, which could result in discrepancies between the current cases and those used in previous evaluation.