This repo include the papers discussed in our survey paper A Survey on LLM-as-a-Judge
Feel free to cite if you find our survey is useful for your research:
@article{gu2024surveyllmasajudge,
title = {A Survey on LLM-as-a-Judge},
author = {Jiawei Gu and Xuhui Jiang and Zhichao Shi and Hexiang Tan and Xuehao Zhai and Chengjin Xu and Wei Li and Yinghan Shen and Shengjie Ma and Honghao Liu and Yuanzhuo Wang and Jian Guo},
year = {2024},
journal = {arXiv preprint arXiv: 2411.15594}
}
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Preprint
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes [Paper] [Code], 2024.07
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators AAAI 2024
Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Malu Zhang, Haizhou Li [Paper] [Code], 2024.01
Large Language Models Cannot Self-Correct Reasoning Yet ICLR 2024
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou [Paper], 2024.05
Large Language Models are not Fair Evaluators ACL 2024
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, Zhifang Sui [Paper] [Code], 2023.08
Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models ACL 2024
Abhishek Kumar, Sarfaroz Yunusov, Ali Emami [Paper] [Code], 2024.06
Are LLM-based Evaluators Confusing NLG Quality Criteria ACL 2024
Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan [Paper] [Code], 2024.06
Likelihood-based Mitigation of Evaluation Bias in Large Language Models ACL 2024 findings
Masanari Ohi, Masahiro Kaneko, Ryuto Koike, Mengsay Loem, Naoaki Okazaki [Paper], 2024.05
Can Large Language Models Be an Alternative to Human Evaluations? ACL 2023
Cheng-Han Chiang, Hung-yi Lee [Paper], 2023.05
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks EMNLP 2023
Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan [Paper] [Code], 2023.10
Is ChatGPT a Good NLG Evaluator? A Preliminary Study NewSumm @ EMNLP 2023
Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie Zhou [Paper] [Code], 2023.10
Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency? NAACL 2024 findings
Nathan Brake, Thomas Schaaf [Paper], 2024.04
Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks COLING 2024
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study Preprint
Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu [Paper] [Code], 2023.09
Humans or LLMs as the Judge? A Study on Judgement Biases Preprint
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang [Paper], 2024.06
On the Limitations of Fine-tuned Judge Models for LLM Evaluation Preprint
Hui Huang, Yingqi Qu, Hongli Zhou, Jing Liu, Muyun Yang, Bing Xu, Tiejun Zhao [Paper] [Code], 2024.06
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment Preprint
Vyas Raina, Adian Liusie, Mark Gales [Paper] [Code], 2024.07
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates Preprint
Hui Wei, Shenghua He, Tian Xia, Andy Wong, Jingyang Lin, Mei Han [Paper] [Code], 2024.08
On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs ICLR 2024 (oral)
Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu [Paper] [Code], 2023.12
Generative Judge for Evaluating Alignment ICLR 2024
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu [Paper] [Code], 2023.12
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization ICLR 2024
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang [Paper] [Code], 2024.05
Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph ICLR 2024
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M. Ni, Heung-Yeung Shum, Jian Guo [Paper] [Code], 2024.05
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition ACL 2024
Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang [Paper], 2024.02
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation ACL 2024
Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng [Paper] [Code], 2024.06
FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model ACL 2024
Yebin Lee, Imseong Park, Myungjoo Kang [Paper] [Code], 2024.06
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models ACL 2024
Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Wei Ye, Jindong Wang, Xing Xie, Yue Zhang, Shikun Zhang [Paper] [Code], 2024.06
ProxyQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models ACL 2024
Haochen Tan, Zhijiang Guo, Zhan Shi, Lu Xu, Zhili Liu, Yunlong Feng, Xiaoguang Li, Yasheng Wang, Lifeng Shang, Qun Liu, Linqi Song [Paper] [Code], 2024.06
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation ACL 2024
Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang [Paper] [Code], 2024.06
Aligning Large Language Models by On-Policy Self-Judgment ACL 2024
Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, Youngjae Yu [Paper] [Code], 2024.06
FineSurE: Fine-grained Summarization Evaluation using LLMs ACL 2024
Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour [Paper] [Code], 2024.07
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models ACL 2024 findings
Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao [Paper] [Code], 2024.06
LLM-EVAL: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models NLP4ConvAI @ ACL 2023
Yen-Ting Lin, Yun-Nung Chen [Paper], 2023.05
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment EMNLP 2023
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu [Paper] [Code], 2023.05
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models EMNLP 2023
Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, Idan Szpektor [Paper] [Code], 2023.10
INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback EMNLP 2023
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Wang, Lei Li [Paper], 2023.10
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation EMNLP 2023
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi [Paper] [Code], 2023.10
Revisiting Automated Topic Model Evaluation with Large Language Models EMNLP 2023 (short)
Dominik Stammbach, Vilém Zouhar, Alexander Hoyle, Mrinmaya Sachan, Elliott Ash [Paper] [Code], 2023.10
CLAIR: Evaluating Image Captions with Large Language Models EMNLP 2023 (short)
David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, John Canny [Paper] [Code], 2023.10
GENRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models NAACL 2024
Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, Jiawei Han [Paper] [Code], 2024.02
GPTScore: Evaluate as You Desire NAACL 2024
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu [Paper] [Code], 2023.02
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation NAACL 2024
Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, Xian Li [Paper], 2024.06
A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models NAACL 2024 (short)
Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, Huan Sun [Paper] [Code], 2024.03
SocREval: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation NAACL 2024 findings
Hangfeng He, Hongming Zhang, Dan Roth [Paper] [Code], 2024.06
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators COLM 2024
Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vulić, Anna Korhonen, Nigel Collier [Paper] [Code], 2024.08
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation NeurIPS 2023
Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, William Yang Wang [Paper] [Code], 2023.05
Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning NeurIPS 2023
Beichen Zhang, Kun Zhou, Xilin Wei, Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen [Paper] [Code], 2023.06
RRHF: Rank Responses to Align Language Models with Human Feedback without tears NeurIPS 2023
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang [Paper] [Code], 2023.10
Reflexion: Language Agents with Verbal Reinforcement Learning NeuralIPS 2023
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao [Paper] [Code], 2023.10
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation NeurIPS 2023
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang [Paper] [Code], 2023.10
Self-Evaluation Guided Beam Search for Reasoning NeurIPS 2023
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, Michael Xie [Paper] [Code], 2023.10
Benchmarking Foundation Models with Language-Model-as-an-Examiner NeurIPS 2023
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei Hou [Paper] [Code], 2023.11
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena NeurIPS 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica [Paper] [Code], 2023.12
*Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality** Blog
Human-like summarization evaluation with chatgpt Preprint
Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, Xiaojun Wan [Paper], 2023.04
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct Preprint
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Dongmei Zhang [Paper] [Code] [Model], 2023.08
Judgelm: Fine-tuned large language models are scalable judges Preprint
Lianghui Zhu, Xinggang Wang, Xinlong Wang [Paper] [Code], 2023.10
Goal-Oriented Prompt Attack and Safety Evaluation for LLMs Preprint
Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, Fei Wu [Paper] [Code], 2023.12
JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models Preprint
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators Preprint
Yann Dubois, Balázs Galambosi, Percy Liang, Tatsunori B. Hashimoto [Paper] [Code], 2024.04
OffsetBias: Leveraging Debiased Data for Tuning Evaluators Preprint
Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, Sanghyuk Choi [Paper] [Code], 2024.07
DHP Benchmark: Are LLMs Good NLG Evaluators? Preprint
Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, Xia Hu [Paper], 2024.08
Generative Verifiers: Reward Modeling as Next-Token Prediction Preprint
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal [Paper], 2024.08
Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation Preprint
Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, Bahador Saket [Paper], 2024.09
LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization Preprint
Abhishek Kumar, Sonia Haiduc, Partha Pratim Das, Partha Pratim Chakrabarti [Paper] [Code], 2024.09
Reasoning with Language Model is Planning with World Model EMNLP 2023
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, Zhiting Hu [Paper] [Code] [Reasoners] [Blog], 2023.05
Solving Math Word Problems via Cooperative Reasoning induced Language Models ACL 2023
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, Yujiu Yang [Paper] [Code], 2023.07
Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language NeurIPS 2023
Deductive Verification of Chain-of-Thought Reasoning NeurIPS 2023
Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, Hao Su [Paper] [Code], 2023.10
Language Models Can Improve Event Prediction by Few-Shot Abductive Reasoning NeurIPS 2023
Xiaoming Shi, Siqiao Xue, Kangrui Wang, Fan Zhou, James Zhang, Jun Zhou, Chenhao Tan, Hongyuan Mei [Paper] [Code], 2023.10
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models NeurIPS 2023
Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, Sibei Yang [Paper] [Code], 2023.10
Learning to Reason and Memorize with Self-Notes NeurIPS 2023
Jack Lanchantin, Shubham Toshniwal, Jason Weston, arthur szlam, Sainbayar Sukhbaatar [Paper], 2023.10
Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning NeurIPS 2023
Xiaoqian Wu, Yong-Lu Li, Jianhua Sun, Cewu Lu [Paper] [Code], 2023.11
Tree of Thoughts: Deliberate Problem Solving with Large Language Models NeurIPS 2023
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, Karthik Narasimhan [Paper] [Code], 2023.12
Understanding Social Reasoning in Language Models with Language Models NeurIPS 2023
Kanishk Gandhi, Jan-Philipp Fraenken, Tobias Gerstenberg, Noah Goodman [Paper] [Code], 2023.12
Automatic model selection with large language models for reasoning EMNLP 2023 findings
James Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, Michael Xie [Paper] [Code], 2023.10
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation Preprint
Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei Zhang, Si Qin, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang [Paper] [Code], 2024.02
Math-Shepherd: A Label-Free Step-by-Step Verifier for LLMs in Mathematical Reasoning Preprint
Peiyi Wang, Lei Li, Zhihong Shao, R.X. Xu, Damai Dai, Yifei Li, Deli Chen, Y.Wu, Zhifang Sui [Paper], 2024.02
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents Preprint
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov [Paper], 2024.08