improve the prompt examples of one-shot setting in APPS evaluation

bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.

Apache License 2.0

815 stars 217 forks source link

improve the prompt examples of one-shot setting in APPS evaluation #8

Closed loubnabnl closed 1 year ago

loubnabnl commented 2 years ago

Models are usually evaluated on APPS after fine-tuning on the train split, but one can also do few-shot evaluation. It is already implemented in this evaluation harness: the prompt includes two shortened examples from the train split one for each call type (Standard Input and Call based).

We want to improve these examples:

analyse the different question types of APPS and build 2 or 3 examples to cover these types (make sure they aren't in the test set)
see how models behave given different examples (you can play with the model demos/spaces in this org there's codeparrot, incoder and codegen)
the prompt shouldn't end up being too long

hajipour commented 2 years ago

There are different levels of problem in the APPS dataset (e.g., introductory and interview levels). Do we want a particular prompt for each level, or do we want a level-agnostic prompt? Maybe both would be interesting.

loubnabnl commented 2 years ago

Hi, you can try both and see if it makes a difference, but I think the difference might depend on how the question is formulated rather than the difficulty level, question format can be different based on the source of the file (codeforces, hackerank..) even for similar difficulty levels.

Any experiment are welcome to see what generalizes better.

giulio98 commented 1 year ago

Hello, regarding this I have run a couple of experiments on CodeGen-mono and multi after they have been finetuned on the train split of this dataset, I did the evaluation on the first half of test split(400 samples) on different scenarios:

zero-shot
few-shot: where two examples were taken randomly from the second half of the test. What I noticed though is a drop in the performance for pass@1 and pass@10 metric in the few-shot setting could it be due to the fact that I randomly selected the examples to put in the context? or the model I'm using (350M) is too "small" to observe few-shot capability?

loubnabnl commented 1 year ago

I think it could be either of the two possibilities, but few shot examples should be chosen carefully as well as the way you embed them in the prompt so that they help the model understand the context and task.

You can try building some few shot prompts and test how the model behaves on them before considering using them for evaluation. For easy experimentation you can use this demo for codegen-2B for example, or build one for codegen-350M-multi with Gradio.