Execution-based FIM evaluation

bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.

Apache License 2.0

771 stars 201 forks source link

Closed arjunguha closed 1 year ago

arjunguha commented 1 year ago

The SantaCoder FIM evaluation with MultiPL-E uses exact match. We should also execute the generated code. The dataset is here:

All that is needed is is to execute item['prefix'] + generated_solution + item['suffix'] + item['tests'.

I recommend supporting n samples per item.