bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

Execution-based FIM evaluation #33

Closed arjunguha closed 1 year ago

arjunguha commented 1 year ago

The SantaCoder FIM evaluation with MultiPL-E uses exact match. We should also execute the generated code. The dataset is here:

https://huggingface.co/datasets/bigcode/santacoder-fim-task

All that is needed is is to execute item['prefix'] + generated_solution + item['suffix'] + item['tests'.

I recommend supporting n samples per item.