Add configs for testing harness

This PR adds scripts/generate_harness.py. This script can be used to generate a harness command, like that in harness.sh scripts, based on the contents of config files. This is an improvement over the current system of copying harness.sh scripts for a few reasons, mainly focused on automating changes and ensuring consistency in evaluation.

harness.sh scripts are hard to read because of the long options, and it's necessary to keep things like order of task names and few-shot options aligned. In the config, tasks are instead individual config entries, and prompt and few-shot values can be set per-task. Additionally, the result output path can be automatically generated, so there's no risk of forgetting to change it.

Task-specific config looks a bit like this:

[tasks.xlsum_ja]
fewshot = 1
# this will inherit the default prompt version

[tasks.xwinograd_ja]
fewshot = 0
# This specifically has no prompt
prompt = ""

This also allows for a hierarchy of configs. For example, it's possible to set a global config at models/harness.conf that serves as the base for configs per organization (like models/stablelm) or per model. Models in higher dirs will be used as a fallback for values, so you can add an eval task to the global config and re-run all evals to get updated values.

generate_harness.py just generates the command line to actually run the eval, so it doesn't require any changes to existing harness scripts or the eval system. This means it can also be used to generate harness.sh scripts like we currently use.

Example usage:

python scripts/generate_harness.py models/stablelm/stablelm-jp-3b-ja50_rp50-700b/
# output:
python main.py --device cuda --model hf-causal --model_args pretrained=./hf_model/3b-ja50_rp50-700b,tokenizer=./tokenizers/nai-hf-tokenizer/,use_fast=False --tasks jcommonsenseqa-1.1-0.3,jnli-0.3,marc_ja-0.3,jsquad-1.1-0.3,jaqket_v2-0.1-0.3,xlsum_ja-0.3,xwinograd_ja,mgsm-0.3 --num_fewshot 3,3,3,2,1,1,0,5 --output_path /mnt/pool/work/stability/lm-evaluation-harness/models/stablelm/stablelm-jp-3b-ja50_rp50-700b/result.json
# note this uses a placeholder PROJECT_DIR of "."

Further documentation is included at the head of generate_harness.py. This PR also includes some base configs to get started.

Stability-AI / lm-evaluation-harness

Add configs for testing harness #78