bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
744 stars 193 forks source link

Program repair #64

Open keyboardAnt opened 1 year ago

keyboardAnt commented 1 year ago
andre15silva commented 1 year ago

Nice to see program repair here! :)

Quick question: Any reason why you consider PyPiBugs only? The dataset contains only 7 types of rather simple bugs.

I would argue that benchmarks like https://github.com/rjust/defects4j (the "gold standard" in academia until now) or https://github.com/giganticode/run_bug_run (recent multi-language executable benchmark) are more interesting and challenging. WDYT?

keyboardAnt commented 1 year ago

Nice to see program repair here! :)

Quick question: Any reason why you consider PyPiBugs only? The dataset contains only 7 types of rather simple bugs.

I would argue that benchmarks like https://github.com/rjust/defects4j (the "gold standard" in academia until now) or https://github.com/giganticode/run_bug_run (recent multi-language executable benchmark) are more interesting and challenging. WDYT?

@andre15silva, thanks for the question. The suggested task could be used with any similar datasets.

keyboardAnt commented 1 year ago

Thanks for these contributions Nadav! I left some high level comments, when addressed we can go through the benchmark details. Could you also explain how this benchmark works? you said it's an inspiration from Carper's and MSFT's benchmarks, does it follow the same implementation as carper? If not what's different and how did you build the dataset since we usually port benchmarks from existing repositories/papers.

Thank you for your queries, @loubnabnl. This task indeed mirrors the benchmark from Carper's and MSFT's. It has been seamlessly integrated into our existing codebase and utilizes a pre-existing dataset.