Closed ncoop57 closed 3 years ago
Here are my two cents regarding good code generation testbeds:
I have a code completion dataset of 1.3 million Java methods with a missing expression (no documentation, the goal is to generate the missing expression given the rest of the method).
However the test set will probably be in GPT-Neo's training set. If there is another Java project that we know that does not exist in The Pile, we can create a test set from it (the code that creates code completion examples from raw code is publicly available).
As of now, based on our spreadsheet. We'll measure the following as Extrinsic Metrics.
Please suggest if we'll have to add anything else. I guess, I'll add them to utils.
As a first step, have added base extrinsic metrics in the branch eval_metrics
.
CodeBLEU will added eventually.
What do you guys think should be the frequency of getting these metrics ?
I'm also interested in learning dynamics and performing intrinsic evaluation
, which would help us understand how the model learns semantics and syntax as the training progresses. Not sure if that can be done within the time frame. But I'll try my best. This might help us solve a lot of limitations like memorizing.
OpenAI has published the article behind Copilot (Chen et al. 2021), in which they introduce HumanEval as benchmark.