bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

Add Reasoning tasks to the evaluation #35

Closed infinitylogesh closed 10 months ago

infinitylogesh commented 1 year ago

In recent times, Code generation models have shown to be good at solving Natural language and/or math reasoning tasks (1 and 2). So, it would be good to evaluate the Bigcode models on these tasks.

As discussed, in the evaluation meeting - We could explore the options of adding PAL-datasets and/or reasoning tasks from HELM

PAL Datasets:

teetone commented 1 year ago

Here is the list of scenarios/tasks we used to evaluate code models in HELM: https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs_bigcode.conf#L1. I created a separate conf file for the BigCode project.

The scenarios are in the scenarios folder: https://github.com/stanford-crfm/helm/tree/main/src/helm/benchmark/scenarios

For example, GSM8K is at src/helm/benchmark/scenarios/gsm_scenario.py: https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/scenarios/gsm_scenario.py.

infinitylogesh commented 1 year ago

Thank you @teetone , the conf file gives a good list of the tasks that can be covered.