dpfried / incoder

Generative model for code infilling and synthesis
294 stars 24 forks source link

Running on Human-Eval #6

Open SpyrosMouselinos opened 2 years ago

SpyrosMouselinos commented 2 years ago

Hello, I am trying to reproduce the results of your model on the Human-Eval Dataset and so far I am getting a lower-than-expected performance. To make everything more clear:

Is there a different procedure/way that the Incoder model solves the Human-Eval dataset? Are the results published assuming the full 32bit weights or a different input format?

dpfried commented 2 years ago

Hi, thanks for your interest. I'm working on cleaning up our human-eval code a bit -- will check in soon. But in the meantime:

I load the large 6B model from hugging face on its fp16 version.

This should be fine - I used fp16 in verification experiments.

I load the Human-Eval dataset and add a BOS = "<|endoftext|>" at the beginning of each code example.

The HuggingFace tokenizer should prepend this automatically when encoding text. You can verify this by calling tokenizer.encode(doc) and checking to see that the first ID in the sequence is 2.

I use the text-generation pipeline at p=0.95 and temp=0.8, creating 100 completions.

This could make a difference -- our pass@1 scores reported in the paper used temp=0.2, while pass@10 and pass@100 used temp=0.8 (following what Chen et al. did in the codex paper).

The post-processing code is relatively simple, I am just looking for typical \def or \n\n tokens to stop generation and have a clean piece of code to pass to the evaluation of human-eval.

Our stop tokens are

HUMAN_EVAL_STOP_WORDS = ["\nclass", "\ndef", "\n#", "\nif"]

i.e. class, def, comment, or if at the beginning of the line (we would have used print as well, following Chen et al., but the codex API can only handle 4 stop words so we did the same across all experiments for compatibility)

I should note too that all the experiments in the paper used a fairseq implementation of our model, but I checked that the scores are similar with the HF-converted version of InCoder 6B on HumanEval left-to-right generation (I haven't gotten the infilling with HF integrated into our eval harness yet): 15 pass@1 for fairseq and HF versions of the 6B model. Code for this coming soon!

dpfried commented 2 years ago

I've checked the code for HumanEval in now to https://github.com/dpfried/incoder/tree/main/evaluation . Please let me know if you try it and run into issues or are still unable to replicate!

SpyrosMouselinos commented 2 years ago

Thanks for the quick response! I was referring to the pass@100 metric, in my experiments. I use the text generation pipeline from the transformers library which has a slightly different behavior from the typical encode --> model.generate() --> decode procedure you use.

It can be broken down to:

Have you seen behavior like this in your experiments? How did you handle random seeds in experiments, since you run x10 times a 20-generation for a total of 200 generations per problem?

dpfried commented 2 years ago

Thanks for the info! I'm running replication experiments now with temperature 0.8 to get the pass@100 scores for the HF version of the model. How big of a gap between our reported pass@100 scores and yours are you seeing?

We didn't set the random seed in our experiments, so every sampled generation (of the 200) should be generated independently.

There may be some differences between the inference procedure I used (in the code I checked in) and the text generation pipeline. The one that I'm aware of is that the generation pipeline doesn't prepend BOS (https://github.com/dpfried/incoder/issues/3#issuecomment-1120254506), but it sounds like you're accounting for that already.

SpyrosMouselinos commented 2 years ago

In fixed seed format (seed = 1 for both torch / random libs), and using num_generations=10 (repeated 20 times in a loop) i seem to get around 35% pass@100. I think the "fixing the seed" part might be limiting to the expressiveness. Let me know if you find out any discrepancies in the HF version, and thanks for taking the time!

dpfried commented 2 years ago

Thanks, yeah that does seem plausible - you may be getting only 10 distinct candidates.

I'll report back once I have results!

On Thu, Jun 16, 2022, 07:35 Spyros Mouselinos @.***> wrote:

In fixed seed format (seed = 1 for both torch / random libs), and using num_generations=10 (repeated 20 times in a loop) i seem to get around 35% @.*** I think the "fixing the seed" part might be limiting to the expressiveness. Let me know if you find out any discrepancies in the HF version, and thanks for taking the time!

— Reply to this email directly, view it on GitHub https://github.com/dpfried/incoder/issues/6#issuecomment-1157733778, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHG2HDGB6EJP7GHH4BFM4LVPM3UJANCNFSM5YT52HNA . You are receiving this because you commented.Message ID: @.***>