**Code Model Evaluation**

CodedotAl / gpt-code-clippy

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57

Apache License 2.0

3.29k stars 221 forks source link

Code Model Evaluation #1

Closed ncoop57 closed 3 years ago

ncoop57 commented 3 years ago

[x] How will we evaluate the model?
[x] What metrics will we use?
[x] What existing scripts could we repurpose?
[x] Modified/newly created eval script created to feed into the rest of the pipeline

neubig commented 3 years ago

Here are my two cents regarding good code generation testbeds:

CoNaLa (https://conala-corpus.github.io/), although it's created from StackOverflow, so the data might already be in GPTNeo's training data
CONCODE (https://arxiv.org/abs/1808.09588) could be nice, but it also seems a bit noisy.
APPS (https://arxiv.org/abs/2105.09938) seems promising but I haven't looked at it closely.

urialon commented 3 years ago

I have a code completion dataset of 1.3 million Java methods with a missing expression (no documentation, the goal is to generate the missing expression given the rest of the method).

However the test set will probably be in GPT-Neo's training set. If there is another Java project that we know that does not exist in The Pile, we can create a test set from it (the code that creates code completion examples from raw code is publicly available).

https://github.com/tech-srl/slm-code-generation

reshinthadithyan commented 3 years ago

As of now, based on our spreadsheet. We'll measure the following as Extrinsic Metrics.

CodeBleu
Parsable Nature of the Generated Code
BLEU-4
Exact Match

Please suggest if we'll have to add anything else. I guess, I'll add them to utils.

reshinthadithyan commented 3 years ago

As a first step, have added base extrinsic metrics in the branch eval_metrics. CodeBLEU will added eventually.

What do you guys think should be the frequency of getting these metrics ?

I'm also interested in learning dynamics and performing intrinsic evaluation, which would help us understand how the model learns semantics and syntax as the training progresses. Not sure if that can be done within the time frame. But I'll try my best. This might help us solve a lot of limitations like memorizing.

shpotes commented 3 years ago

OpenAI has published the article behind Copilot (Chen et al. 2021), in which they introduce HumanEval as benchmark.

CodedotAl / gpt-code-clippy

**Code Model Evaluation** #1

Code Model Evaluation #1