Eval: BBH always stops at \n\n; TydiQA reports recall

Test

(open-instruct) jiachengl@allennlp-cirrascale-50:/net/nfs.cirrascale/allennlp/jiachengl/open-instruct$ CUDA_VISIBLE_DEVICES=2 python -m eval.bbh.run_eval --data_dir data/eval/bbh --save_dir tmp --model ../n-tulu-ppo-jax/ckpt/v1.17_ckpt500 --tokenizer_name_o
r_path ../n-tulu-ppo-jax/ckpt/v1.17_ckpt500 --max_num_examples_per_task 3 --use_chat_format --chat_formatting_function eval.templates.create_prompt_with_tulu_chat_format
[2024-02-24 00:45:36,280] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading tasks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [00:00<00:00, 1315.56it/s]
Loading prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [00:00<00:00, 2438.60it/s]
Loading model and tokenizer with huggingface...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.71s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the
 new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Using pad_token, but it is not set yet.
Evaluating:   0%|                                                                                                                                                                                                                         | 0/27 [00:00<?, ?it/s]
/net/nfs.cirrascale/allennlp/jiachengl/miniconda3/envs/open-instruct/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:381: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0` -- this flag is only used 
in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
Generating Completions: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:19<00:00,  6.46s/it]
Task boolean_expressions - EM: 0.3333333333333333██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:19<00:00,  6.22s/it]
Generating Completions: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:12<00:00,  4.13s/it]
Task causal_judgement - EM: 0.6666666666666666█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:12<00:00,  4.05s/it]
Generating Completions: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00,  2.74s/it]
Task date_understanding - EM: 0.6666666666666666███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00,  2.94s/it]
Evaluating:  11%|███████████████████████▏                                                                                                                                                                                         | 3/27 [00:40<04:49, 12.07s/it]

(open-instruct) jiachengl@allennlp-cirrascale-50:/net/nfs.cirrascale/allennlp/jiachengl/open-instruct$ CUDA_VISIBLE_DEVICES=2 python -m eval.tydiqa.run_eval --data_dir data/eval/tydiqa --n_shot 1 --save_dir tmp --openai_engine gpt-3.5-turbo --max_num_examples_per_lang 1 --use_chat_format --chat_formatting_function eval.templates.create_prompt_with_tulu_chat_format
[2024-02-24 01:44:50,027] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading data...
Loaded 9 examples from 9 languages: ['arabic', 'bengali', 'english', 'finnish', 'indonesian', 'korean', 'russian', 'swahili', 'telugu']
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:07<00:00,  1.27it/s]
Calculating F1, EM ...
Calculating recall ...
Scores:
{   
    "arabic": {
        "exact_match": 0.0,
        "f1": 60.60606060606061,
        "recall": 0.0
    },
    "bengali": {
        "exact_match": 0.0,
        "f1": 50.0,
        "recall": 100.0
    },
    "english": {
        "exact_match": 0.0,
        "f1": 91.89189189189189,
        "recall": 100.0
    },
    "finnish": {
        "exact_match": 0.0,
        "f1": 28.57142857142857,
        "recall": 100.0
    },
    "indonesian": {
        "exact_match": 100.0,
        "f1": 100.0,
        "recall": 100.0
    },
    "korean": {
        "exact_match": 0.0,
        "f1": 33.33333333333333,
        "recall": 100.0
    },
    "russian": {
        "exact_match": 0.0,
        "f1": 0.0,
        "recall": 0.0
    },
    "swahili": {
        "exact_match": 0.0,
        "f1": 87.50000000000001,
        "recall": 100.0
    },
    "telugu": {
        "exact_match": 0.0,
        "f1": 0.0,
        "recall": 0.0
    },
    "average": {
        "f1": 50.211412711412706,
        "exact_match": 11.11111111111111,
        "recall": 66.66666666666667
    }
}
Done!

allenai / open-instruct

Eval: BBH always stops at \n\n; TydiQA reports recall #122