Closed loubnabnl closed 1 year ago
There are different levels of problem in the APPS dataset (e.g., introductory and interview levels). Do we want a particular prompt for each level, or do we want a level-agnostic prompt? Maybe both would be interesting.
Hi, you can try both and see if it makes a difference, but I think the difference might depend on how the question is formulated rather than the difficulty level, question format can be different based on the source of the file (codeforces, hackerank..) even for similar difficulty levels.
Any experiment are welcome to see what generalizes better.
Hello, regarding this I have run a couple of experiments on CodeGen-mono and multi after they have been finetuned on the train split of this dataset, I did the evaluation on the first half of test split(400 samples) on different scenarios:
I think it could be either of the two possibilities, but few shot examples should be chosen carefully as well as the way you embed them in the prompt so that they help the model understand the context and task.
You can try building some few shot prompts and test how the model behaves on them before considering using them for evaluation. For easy experimentation you can use this demo for codegen-2B for example, or build one for codegen-350M-multi with Gradio.
Models are usually evaluated on APPS after fine-tuning on the train split, but one can also do few-shot evaluation. It is already implemented in this evaluation harness: the prompt includes two shortened examples from the train split one for each call type (Standard Input and Call based).
We want to improve these examples: