Holmes-Benchmark / holmes-evaluation

4 stars 1 forks source link

Project Page | Explorer πŸ”Ž | Leaderboard πŸš€

Holmes πŸ”Ž is benchmark dedicated to asses the linguistic competence of language models and features:

Missing a specific language model? Either email us or evaluate it yourself. πŸ‘‡

Build Opensource Probing Datasets Lingustic Phenomena Probing Datasets

πŸ”Ž How does it work?

πŸ”ŽοΈ Setting up the environment

To evaluate your chosen language model using the Holmes πŸ”Ž or FlashHolmes ⚑ benchmarks, please ensure your setup meets the following requirements:

πŸ”Ž Getting the data

Don't worry about parsing linguistic corpora and composing probing datasets: we already did that. Find the download instructions for Holmes πŸ”Ž (here) and for FlashHolmes ⚑ (here).

πŸ”Ž Investigate your language model using one command

For ease of use, you can evaluate a language model with no more than on command like:

The investigation script (src/investigate.py) requires the following essential commands:

After running all probes an evaluations, you will find the aggregated results in the results folder. Either in results_holmes.csv for Holmes πŸ”Ž or results_flash-holmes.csv for FlashHolmes ⚑.

πŸ”ŽDisclaimer

We provide datasets in a specific format without endorsing their quality, fairness, or confirming your licensing rights. Users must verify their permissions under the dataset's license and properly credit the dataset owner.

If you own a dataset and want to update or remove it from our library, please contact us. Additionally, if you wish to include your dataset or model for evaluation, feel free contact as well!

References

Hewitt, J., & Liang, P. (2019, November). Designing and Interpreting Probes with Control Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 2733-2743).

Voita, E., & Titov, I. (2020). Information-theoretic probing with minimum description length. In EMNLP 2020-2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 183-196).