Hi Team,
I was using the bigcode-evaluation-harness to evaluate generation for go on Multiple-E dataset and found that, all the evaluation had output ? command-line-arguments [no test files] although status_code = 0.
On debugging further, it looks like we set self.language here instead of prompt_name['langugage'] in the problem dict to process execution downstream, and when language is checked in evaluators here, it is appended without _test.go suffix leading to non detecting any test files.
To make it easy to repro this, I have added a video below which evaluate one go generation test case (used deepseek coder to generate this)
generations_go_example.json
[
[
"package strlen_test\n\nimport (\n \"testing\"\n \"fmt\"\n)\n\n// Return length of given string\n// >>> strlen(\"\")\n// 0\n// >>> strlen(\"abc\")\n// 3\nfunc strlen(myString string) int {\n return len(myString)\n}\n"
]
]
Hi Team, I was using the bigcode-evaluation-harness to evaluate generation for go on Multiple-E dataset and found that, all the evaluation had output
? command-line-arguments [no test files]
althoughstatus_code = 0
. On debugging further, it looks like we set self.language here instead ofprompt_name['langugage']
in the problem dict to process execution downstream, and when language is checked in evaluators here, it is appended without_test.go
suffix leading to non detecting any test files.To make it easy to repro this, I have added a video below which evaluate one go generation test case (used deepseek coder to generate this)
generations_go_example.json
https://github.com/bigcode-project/bigcode-evaluation-harness/assets/20701220/c57dd498-b7f8-488a-a842-f9eb405f1f0d