LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.94k stars 3.22k forks source link

Consider using OpenAI Evals #2348

Open walking-octopus opened 1 year ago

walking-octopus commented 1 year ago

Being the only thing even tangentially related to GPT-4 made open-source, OpenAI Evals can help assess the performance of an LLM on a growing set of tasks. Maybe this could be used to measure how close to GPT-3.5 OpenAssistant models have got so far, help testing different models fine-tuned on it, or maybe even use it in the RL pipeline.

https://github.com/openai/evals

andreaskoepf commented 1 year ago

Yes I think such an assessment would make sense for our 30B LLaMA based model. If someone is interested in doing this please let us know.

chiayewken commented 1 year ago

Hi, I'm making an objective benchmark suite for open-source LLMs, which currently includes MMLU and BBH which were used in the GPT-4 paper. I'm really excited for OA and would like to evaluate your 30B LLaMA model, could you publish the weights (delta/lora/original versions) on the HuggingFace hub?

image

https://github.com/declare-lab/flan-eval

olliestanley commented 1 year ago

Hi, I'm making an objective benchmark suite for open-source LLMs, which currently includes MMLU and BBH which were used in the GPT-4 paper. I'm really excited for OA and would like to evaluate your 30B LLaMA model, could you publish the weights (delta/lora/original versions) on the HuggingFace hub?

image

https://github.com/declare-lab/flan-eval

OA model weights will start releasing from April 15th. I think we plan to release LLaMa 30B deltas on that date.

chiayewken commented 1 year ago

Great, looking forward to it :)

CarlKenner commented 1 year ago

Shouldn't their evals be used for training not for evaluating?

tju01 commented 1 year ago

I'm interested in this issue and have started working on it.

tju01 commented 1 year ago

I have evaluated the OpenAssistant RLHF model and built a simple UI to view the scores and also the outputs because often the scores on their own can be misleading about the actual quality. The current version is over here: https://tju01.github.io/oasst-openai-evals/. Click on the corresponding task name to see the evaluation details for that specific task. I still have multiple ideas for improvements, but I have some questions before that related to the bad scores that the OpenAssistant model obtains.

  1. OpenAI evals heavily uses a system message for the model. While OpenAI GPT models can handle this just fine, I'm not really sure how to translate this to OpenAssistant models. I'm currently using the <|system|> token since it seems likeoasst-rlhf-2-llama-30b-7k-steps was trained with it (at least according to the added_tokens.json file), but I have doubts on whether that's how the <|system|> token is used in OpenAssistant models. Possible it would be better to use <|prefix_begin|> and <|prefix_end|>? Is that still used in the current models?
  2. I have currently evaluated the oasst-rlhf-2-llama-30b-7k-steps model. Given that it is a RLHF model, I believe something like sampling n outputs and choosing the one with the best reward according to the reward model might improve results. But I would need access to the corresponding reward model for that. Is that available somewhere?
  3. I'm not sure how good the RLHF OpenAssistant model actually is. Maybe the SFT models are actually better right now? Which SFT model is the oasst-rlhf-2-llama-30b-7k-steps derived from and did the RLHF part actually improve other evaluations right now? I know that generally speaking it's important.
tju01 commented 1 year ago

I've had my questions answered on the discord server. I have done the basic evaluation of multiple models, but there is lots of room for improvement. I'm going to continue at https://github.com/tju01/oasst-automatic-model-eval where I'm also going to add support for other evaluation benchmarks, see #1908.