Closed tju01 closed 1 year ago
Thanks for your interest :) We would encourage you to continue evaluating on those benchmarks and push to this repository. We maintain a leaderboard where we could add these benchmarks as well: https://declare-lab.net/instruct-eval/
Hi, great work!
I am working on something similar for evaluating instruction-following language models. A big difference is that I focus primarily on zero-shot evaluation using the model-specific prompt format that was used during fine-tuning. Other than that, I use different benchmarks (OpenAI evals, Elo Ranking, EvalPlus instead of HumanEval, now working on CoT evaluation) and store the model outputs so that they can be inspected on the website. Also right now I evaluated way less models.
I plan to continue working on it, but I would like to avoid future duplicate work. I therefore wonder about the future direction of your project. Do you already have significant changes or additions planned? And especially, considering the overlap, do you plan to do zero-shot evaluation using the corresponding prompt formats that the models have been fine-tuned with?