Add 8-task scores for rinna bilingual models

Stability-AI / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.

MIT License

144 stars 47 forks source link

Add 8-task scores for rinna bilingual models #77

Closed mkshing closed 1 year ago

mkshing commented 1 year ago

Description

Evaluated the following rinna bilingual models with our 8-task setting

https://huggingface.co/rinna/bilingual-gpt-neox-4b
https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft
https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-ppo

effendijohanes commented 1 year ago

Hi, thank you for adding the evaluation for Rinna models. I have a question,

How to decide which prompt to be used for evaluating model? In this foundation model case, why v0.2 is being used instead of v0.3? because the accuracy result of both prompt are much different.

Thank you!

mkshing commented 1 year ago

@effendijohanes it really depends on the data format during training. Please see the below link for each prompt version. https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md