Closed wasiahmad closed 4 months ago
Sure, I have uploaded the category labels for all HumanEval problems. Please check the annotations directory.
Regarding the categorization procedure, as mentioned in our paper, we evaluated models (GPT-4, GPT-3.5, DeepSeekCoder, and WizardCoder) on HumanEval and derived some categories by analyzing errors made by LLMs. However, these categories reflect the models' weaknesses rather than the dataset's challenges. Therefore, our team of four annotators revisited the entire HumanEval benchmark to ensure the categorization is as model-agnostic as possible. Each annotator worked independently, and we later discussed and consolidated the results for each problem. While the categorization may not be perfect, we have made every effort to ensure consistency.
The paper categorizes the code generation challenges in HumanEval into 7 categories. How the categorization is done? Can you share the category label for each HumanEval problem?