Add GSM8k - Math Reasoning Dataset to the evaluation tasks

infinitylogesh commented 1 year ago

PR to add GSM8k dataset as in Program-Aided Language Model paper to few shot evaluation ( Issue #35 )

infinitylogesh commented 1 year ago

Added support for the two variations of the GSM dataset - GSM8K and GSM-HARD with the two evaluation settings used in the paper - Greedy decoding and Majority voting (the answer to a solution is picked based on the voting of n_samples generated).

In the original PAL paper authors have only benchmarked the Codex models. To test the scoring and python executor in the implementation, I ran the codex generations and evaluation using original PAL repo. Used those generations to run execution and evaluation in our harness ( with --generations_path) for comparison. The scores match with the scores from PAL script.

Dataset	PAL	Bigcode PAL
gsm8k	69.7725549658832	69.7725549658832

Some questions:

The authors have used 8 shot prompt ( prompt ) to do the evaluation. I could see that in the harness, Mostly 2 shots were being used. Is there any specific reason for this (context length considerations etc)? As of now, I have gone with 2 shot prompt. But any guidance on If I should keep it to 2 or 8 would be helpful.
Included the evaluation type of the setting in the task signature as - pal-{dataset_name}-{evaluation_type} (eg: pal-gsm8k-greedy,pal-gsmhard-majority_voting). Hope this convention is fine?

Please let me know if there is any comment or suggestions. Thanks

infinitylogesh commented 1 year ago

Further results from comparing the evaluation of codex generations with original PAL repo and our implementation for GSM Hard - Majority voting (n_samples-4) setting:

Dataset	PAL	Bigcode PAL
gsm-hard	62.1683	62.0167

infinitylogesh commented 1 year ago

Thank you very much for the review and comments. Please find the summary of the changes:

Created a test to compare the prompt from PaL repo ( along with a test for scoring ) and our generated prompt. The prompts seem to match.
As per the recommendation, Updated the few shot count to 8 as it fits within 2048 context and has room for generation.
Applied isort import changes on top of black as suggested.

Please let me know if any further comments or suggestions. Thanks

loubnabnl commented 1 year ago

Thanks Logesh! Everything looks good, I tried running the evaluation and got a couple of warnings like this: UserWarning: ValueError - could not convert string to float: '' during scoring task_id- 16,answer - ,reference -230defaulting evaluation score for this answer to 0 Is this expected?

infinitylogesh commented 1 year ago

Thank you for reviewing Loubna! Yes, I think this is expected, This warning could happen If the stdout read from executing the generation is in an unexpected format (generations with syntax issues or without return statements or returning a non-number value). For these types of generations, we default the score to zero. This warning is informing the user about this.

Please let me know if this is not informative or not communicating this intent, We can rephrase this a bit.

loubnabnl commented 1 year ago

Yes I think we can probably hide these warnings if it's just failed execution from bad completions (we do the same for humaneval..) otherwise we can get a lot of warnings

infinitylogesh commented 1 year ago

Thanks for the suggestion. I have disabled the warning now.

bigcode-project / bigcode-evaluation-harness

Add GSM8k - Math Reasoning Dataset to the evaluation tasks #37