declare-lab / instruct-eval

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
https://declare-lab.github.io/instruct-eval/
Apache License 2.0
528 stars 42 forks source link

Future directions #11

Closed tju01 closed 1 year ago

tju01 commented 1 year ago

Hi, great work!

I am working on something similar for evaluating instruction-following language models. A big difference is that I focus primarily on zero-shot evaluation using the model-specific prompt format that was used during fine-tuning. Other than that, I use different benchmarks (OpenAI evals, Elo Ranking, EvalPlus instead of HumanEval, now working on CoT evaluation) and store the model outputs so that they can be inspected on the website. Also right now I evaluated way less models.

I plan to continue working on it, but I would like to avoid future duplicate work. I therefore wonder about the future direction of your project. Do you already have significant changes or additions planned? And especially, considering the overlap, do you plan to do zero-shot evaluation using the corresponding prompt formats that the models have been fine-tuned with?

soujanyaporia commented 1 year ago

Thanks for your interest :) We would encourage you to continue evaluating on those benchmarks and push to this repository. We maintain a leaderboard where we could add these benchmarks as well: https://declare-lab.net/instruct-eval/