Closed infinitylogesh closed 1 year ago
Added support for the two variations of the GSM dataset - GSM8K
and GSM-HARD
with the two evaluation settings used in the paper - Greedy decoding
and Majority voting
(the answer to a solution is picked based on the voting of n_samples generated).
In the original PAL paper authors have only benchmarked the Codex models. To test the scoring and python executor in the implementation, I ran the codex generations and evaluation using original PAL repo. Used those generations to run execution and evaluation in our harness ( with --generations_path
) for comparison. The scores match with the scores from PAL script.
Dataset | PAL | Bigcode PAL |
---|---|---|
gsm8k | 69.7725549658832 | 69.7725549658832 |
Some questions:
pal-{dataset_name}-{evaluation_type}
(eg: pal-gsm8k-greedy
,pal-gsmhard-majority_voting
). Hope this convention is fine?Please let me know if there is any comment or suggestions. Thanks
Further results from comparing the evaluation of codex generations with original PAL repo and our implementation for GSM Hard - Majority voting (n_samples-4) setting:
Dataset | PAL | Bigcode PAL |
---|---|---|
gsm-hard | 62.1683 | 62.0167 |
Thank you very much for the review and comments. Please find the summary of the changes:
Please let me know if any further comments or suggestions. Thanks
Thanks Logesh! Everything looks good, I tried running the evaluation and got a couple of warnings like this:
UserWarning: ValueError - could not convert string to float: '' during scoring task_id- 16,answer - ,reference -230defaulting evaluation score for this answer to 0
Is this expected?
Thank you for reviewing Loubna! Yes, I think this is expected, This warning could happen If the stdout read from executing the generation is in an unexpected format (generations with syntax issues or without return statements or returning a non-number value). For these types of generations, we default the score to zero. This warning is informing the user about this.
Please let me know if this is not informative or not communicating this intent, We can rephrase this a bit.
Yes I think we can probably hide these warnings if it's just failed execution from bad completions (we do the same for humaneval..) otherwise we can get a lot of warnings
Thanks for the suggestion. I have disabled the warning now.
PR to add GSM8k dataset as in Program-Aided Language Model paper to few shot evaluation ( Issue #35 )