Stability-AI / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.
MIT License
145 stars 47 forks source link

add llama2 format #100

Closed mkshing closed 1 year ago

mkshing commented 1 year ago

Description

Added a new prompt version of llama2-chat by following https://huggingface.co/blog/llama2#how-to-prompt-llama-2. This PR enables to evaluate all llama2 variants including ELYZA-japanese-Llama-2-7b-instruct.

Usage

2 key points:

  1. Set the correct system prompt as SYSTEM_PROMPT environment variable
  2. Use 0.6 for prompt version
# Make sure to set the correct system prompt 
export SYSTEM_PROMPT="You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

MODEL_ARGS="pretrained=meta-llama/Llama-2-7b-chat-hf"
TASK="jsquad-1.1-0.6,jcommonsenseqa-1.1-0.6,jnli-1.1-0.6,marc_ja-1.1-0.6,jaqket_v2-0.2-0.6,xlsum_ja-1.0-0.6,xwinograd_ja,mgsm-1.0-0.6"
python main.py \
   --model hf-causal \
   --model_args $MODEL_ARGS \
   --tasks $TASK \
   --num_fewshot "2,3,0,0,1,1,0,5" \
   --device "cuda" \
   --output_path "models/llama2/llama2-7b-chat/result.json" 

Comparison between 0.3 and 0.6 on JCommonsenseQA for ELYZA-japanese-Llama-2-7b-instruct

Task Version Metric Value Stderr
jcommonsenseqa-1.1-0.6 1.1 acc 0.7087 ± 0.0136
acc_norm 0.7015 ± 0.0137
jcommonsenseqa-1.1-0.3 1.1 acc 0.6506 ± 0.0143
acc_norm 0.3539 ± 0.0143
export SYSTEM_PROMPT="あなたは誠実で優秀な日本人のアシスタントです。"
export MODEL_ARGS="pretrained=elyza/ELYZA-japanese-Llama-2-7b-instruct"
export TASK="jcommonsenseqa-1.1-0.6,jcommonsenseqa-1.1-0.3"
python main.py  \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks $TASK \
    --num_fewshot "3,3" \
    --device "cuda" \
    --output_path models/elyza/ELYZA-japanese-Llama-2-7b-instruct/result.json \
{
  "results": {
    "jcommonsenseqa-1.1-0.6": {
      "acc": 0.7086684539767649,
      "acc_stderr": 0.013589216112682913,
      "acc_norm": 0.7015192135835567,
      "acc_norm_stderr": 0.013685386698397504
    },
    "jcommonsenseqa-1.1-0.3": {
      "acc": 0.6505808757819481,
      "acc_stderr": 0.014259460025628168,
      "acc_norm": 0.353887399463807,
      "acc_norm_stderr": 0.01430097848599956
    }
  },
  "versions": {
    "jcommonsenseqa-1.1-0.6": 1.1,
    "jcommonsenseqa-1.1-0.3": 1.1
  },
  "config": {
    "model": "hf-causal",
    "model_args": "pretrained=elyza/ELYZA-japanese-Llama-2-7b-instruct",
    "num_fewshot": [
      3,
      3
    ],
    "batch_size": null,
    "device": "cuda",
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}

leemengtw commented 1 year ago

@mkshing quick comment but maybe we can start to use llama2 instead of 0.6 to improve readability? This is suggested by @polm-stability on Slack as well

mkshing commented 1 year ago

@leemengtw Yeah I agree your point but we need to deprecate the integer version and rename all, which takes extra work. So, at least, in this PR, I want to keep the integer format. Does it make sense??