Closed infinitylogesh closed 10 months ago
Here is the list of scenarios/tasks we used to evaluate code models in HELM: https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs_bigcode.conf#L1. I created a separate conf file for the BigCode project.
The scenarios are in the scenarios
folder: https://github.com/stanford-crfm/helm/tree/main/src/helm/benchmark/scenarios
For example, GSM8K is at src/helm/benchmark/scenarios/gsm_scenario.py
: https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/scenarios/gsm_scenario.py.
Thank you @teetone , the conf file gives a good list of the tasks that can be covered.
In recent times, Code generation models have shown to be good at solving Natural language and/or math reasoning tasks (1 and 2). So, it would be good to evaluate the Bigcode models on these tasks.
As discussed, in the evaluation meeting - We could explore the options of adding PAL-datasets and/or reasoning tasks from HELM
PAL Datasets: