bigcode-project / octopack

🐙 OctoPack: Instruction Tuning Code Large Language Models
https://arxiv.org/abs/2308.07124
MIT License
420 stars 27 forks source link

Using code metric 'code_eval_octopack' instead of original 'code_eval' #16

Open JunHyungKang opened 1 year ago

JunHyungKang commented 1 year ago

1

Is this feature solely for multi-language support? When I run the results through 'code_eval' in the original humaneval.py, I only achieve a 'pass@1' score of about 36%.

2

Are there any other considerations? Is it fair to add an import helper?

Muennighoff commented 1 year ago

1

Yes it is solely for multi-language support. The reason you get only 36% is that the normal humaneval does not use our prompting format (no Question & Answer like during the instruction tuning), so the model then tries to add them. In the normal HumanEval this leads to a syntax error, e.g. "def truncate_number(number: float) -> float:\n \"\"\" Given a positive floating point number, it can be decomposed into\n and integer part (largest integer smaller than given number) and decimals\n (leftover part always smaller than 1).\n\n Return the decimal part of the number.\n >>> truncate_number(3.5)\n 0.5\n \"\"\"\n return number - int(number)\n\n\nAnswer: \"\"\"\nWrite a function that takes a positive floating point number as input and\nreturns the decimal part of the number.\n\nFor example, given the number 3.5, the function should return 0.5.\n\nNote: The input number can be a negative number or zero.\n\nAnswer: import math\n\n", In the normal HumanEval this leads to a syntax error. In HumanEvalSynthesize a) the prompting format is aligned & b) the postprocessing is cleaner such that in the above example the Answer:... would be cutoff from the generation (it cuts off everything after a function is finished) and there would be no syntax error.

I think these are both fair as when using the model it is a) trivial to use the correct prompting format & b) simple to remove trailing stuff that's not needed.

2

The import helpers do not really make a difference - I think for Python they might not change the score at all. The reason they are added is that the model is not given the chance to modify the imports at the top but is directly prompted with the function start. In Python it could add necessary imports even at the function start, but in Go and other ones that does not work.