bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

Add GSM8k - Math Reasoning Dataset to the evaluation tasks #37

Closed infinitylogesh closed 1 year ago

infinitylogesh commented 1 year ago

PR to add GSM8k dataset as in Program-Aided Language Model paper to few shot evaluation ( Issue #35 )

infinitylogesh commented 1 year ago

Added support for the two variations of the GSM dataset - GSM8K and GSM-HARD with the two evaluation settings used in the paper - Greedy decoding and Majority voting (the answer to a solution is picked based on the voting of n_samples generated).

In the original PAL paper authors have only benchmarked the Codex models. To test the scoring and python executor in the implementation, I ran the codex generations and evaluation using original PAL repo. Used those generations to run execution and evaluation in our harness ( with --generations_path) for comparison. The scores match with the scores from PAL script.

Dataset PAL Bigcode PAL
gsm8k 69.7725549658832 69.7725549658832

Some questions:

Please let me know if there is any comment or suggestions. Thanks

infinitylogesh commented 1 year ago

Further results from comparing the evaluation of codex generations with original PAL repo and our implementation for GSM Hard - Majority voting (n_samples-4) setting:

Dataset PAL Bigcode PAL
gsm-hard 62.1683 62.0167
infinitylogesh commented 1 year ago

Thank you very much for the review and comments. Please find the summary of the changes:

  1. Created a test to compare the prompt from PaL repo ( along with a test for scoring ) and our generated prompt. The prompts seem to match.
  2. As per the recommendation, Updated the few shot count to 8 as it fits within 2048 context and has room for generation.
  3. Applied isort import changes on top of black as suggested.

Please let me know if any further comments or suggestions. Thanks

loubnabnl commented 1 year ago

Thanks Logesh! Everything looks good, I tried running the evaluation and got a couple of warnings like this: UserWarning: ValueError - could not convert string to float: '' during scoring task_id- 16,answer - ,reference -230defaulting evaluation score for this answer to 0 Is this expected?

infinitylogesh commented 1 year ago

Thank you for reviewing Loubna! Yes, I think this is expected, This warning could happen If the stdout read from executing the generation is in an unexpected format (generations with syntax issues or without return statements or returning a non-number value). For these types of generations, we default the score to zero. This warning is informing the user about this.

Please let me know if this is not informative or not communicating this intent, We can rephrase this a bit.

loubnabnl commented 1 year ago

Yes I think we can probably hide these warnings if it's just failed execution from bad completions (we do the same for humaneval..) otherwise we can get a lot of warnings

infinitylogesh commented 1 year ago

Thanks for the suggestion. I have disabled the warning now.