add llama2 format - Githubissues

mkshing commented 1 year ago

Description

Added a new prompt version of llama2-chat by following https://huggingface.co/blog/llama2#how-to-prompt-llama-2. This PR enables to evaluate all llama2 variants including ELYZA-japanese-Llama-2-7b-instruct.

Usage

2 key points:

Set the correct system prompt as SYSTEM_PROMPT environment variable
Use 0.6 for prompt version

# Make sure to set the correct system prompt 
export SYSTEM_PROMPT="You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

MODEL_ARGS="pretrained=meta-llama/Llama-2-7b-chat-hf"
TASK="jsquad-1.1-0.6,jcommonsenseqa-1.1-0.6,jnli-1.1-0.6,marc_ja-1.1-0.6,jaqket_v2-0.2-0.6,xlsum_ja-1.0-0.6,xwinograd_ja,mgsm-1.0-0.6"
python main.py \
   --model hf-causal \
   --model_args $MODEL_ARGS \
   --tasks $TASK \
   --num_fewshot "2,3,0,0,1,1,0,5" \
   --device "cuda" \
   --output_path "models/llama2/llama2-7b-chat/result.json"

Comparison between 0.3 and 0.6 on JCommonsenseQA for ELYZA-japanese-Llama-2-7b-instruct

Task	Version	Metric	Value		Stderr
jcommonsenseqa-1.1-0.6	1.1	acc	0.7087	±	0.0136
		acc_norm	0.7015	±	0.0137
jcommonsenseqa-1.1-0.3	1.1	acc	0.6506	±	0.0143
		acc_norm	0.3539	±	0.0143

* 65.15 in their blog, (see "lm-eval-harness" section)

export SYSTEM_PROMPT="あなたは誠実で優秀な日本人のアシスタントです。"
export MODEL_ARGS="pretrained=elyza/ELYZA-japanese-Llama-2-7b-instruct"
export TASK="jcommonsenseqa-1.1-0.6,jcommonsenseqa-1.1-0.3"
python main.py  \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks $TASK \
    --num_fewshot "3,3" \
    --device "cuda" \
    --output_path models/elyza/ELYZA-japanese-Llama-2-7b-instruct/result.json \

{
  "results": {
    "jcommonsenseqa-1.1-0.6": {
      "acc": 0.7086684539767649,
      "acc_stderr": 0.013589216112682913,
      "acc_norm": 0.7015192135835567,
      "acc_norm_stderr": 0.013685386698397504
    },
    "jcommonsenseqa-1.1-0.3": {
      "acc": 0.6505808757819481,
      "acc_stderr": 0.014259460025628168,
      "acc_norm": 0.353887399463807,
      "acc_norm_stderr": 0.01430097848599956
    }
  },
  "versions": {
    "jcommonsenseqa-1.1-0.6": 1.1,
    "jcommonsenseqa-1.1-0.3": 1.1
  },
  "config": {
    "model": "hf-causal",
    "model_args": "pretrained=elyza/ELYZA-japanese-Llama-2-7b-instruct",
    "num_fewshot": [
      3,
      3
    ],
    "batch_size": null,
    "device": "cuda",
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}

leemengtw commented 1 year ago

@mkshing quick comment but maybe we can start to use llama2 instead of 0.6 to improve readability? This is suggested by @polm-stability on Slack as well

mkshing commented 1 year ago

@leemengtw Yeah I agree your point but we need to deprecate the integer version and rename all, which takes extra work. So, at least, in this PR, I want to keep the integer format. Does it make sense??

Stability-AI / lm-evaluation-harness

add llama2 format #100

Description

Usage

Comparison between 0.3 and 0.6 on JCommonsenseQA for ELYZA-japanese-Llama-2-7b-instruct