CodedotAl / gpt-code-clippy

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57
Apache License 2.0
3.3k stars 224 forks source link

**Code Model Evaluation** #1

Closed ncoop57 closed 3 years ago

ncoop57 commented 3 years ago
neubig commented 3 years ago

Here are my two cents regarding good code generation testbeds:

urialon commented 3 years ago

I have a code completion dataset of 1.3 million Java methods with a missing expression (no documentation, the goal is to generate the missing expression given the rest of the method).

However the test set will probably be in GPT-Neo's training set. If there is another Java project that we know that does not exist in The Pile, we can create a test set from it (the code that creates code completion examples from raw code is publicly available).

https://github.com/tech-srl/slm-code-generation

reshinthadithyan commented 3 years ago

As of now, based on our spreadsheet. We'll measure the following as Extrinsic Metrics.

  1. CodeBleu
  2. Parsable Nature of the Generated Code
  3. BLEU-4
  4. Exact Match

Please suggest if we'll have to add anything else. I guess, I'll add them to utils.

reshinthadithyan commented 3 years ago

As a first step, have added base extrinsic metrics in the branch eval_metrics. CodeBLEU will added eventually.

What do you guys think should be the frequency of getting these metrics ?

I'm also interested in learning dynamics and performing intrinsic evaluation, which would help us understand how the model learns semantics and syntax as the training progresses. Not sure if that can be done within the time frame. But I'll try my best. This might help us solve a lot of limitations like memorizing.

shpotes commented 3 years ago

OpenAI has published the article behind Copilot (Chen et al. 2021), in which they introduce HumanEval as benchmark.