URL

https://arxiv.org/abs/2411.00331
Authors
- Chumeng Jiang
- Jiayin Wang
- Weizhi Ma
- Charles L. A. Clarke
- Shuai Wang
- Chuhan Wu
- Min Zhang
  Abstract
- With the rapid development of Large Language Models (LLMs), recent studies employed LLMs as recommenders to provide personalized information services for distinct users. Despite efforts to improve the accuracy of LLM-based recommendation models, relatively little attention is paid to beyond-utility dimensions. Moreover, there are unique evaluation aspects of LLM-based recommendation models, which have been largely ignored. To bridge this gap, we explore four new evaluation dimensions and propose a multidimensional evaluation framework. The new evaluation dimensions include: 1) history length sensitivity, 2) candidate position bias, 3) generation-involved performance, and 4) hallucinations. All four dimensions have the potential to impact performance, but are largely unnecessary for consideration in traditional systems. Using this multidimensional evaluation framework, along with traditional aspects, we evaluate the performance of seven LLM-based recommenders, with three prompting strategies, comparing them with six traditional models on both ranking and re-ranking tasks on four datasets. We find that LLMs excel at handling tasks with prior knowledge and shorter input histories in the ranking setting, and perform better in the re-ranking setting, beating traditional models across multiple dimensions. However, LLMs exhibit substantial candidate position bias issues, and some models hallucinate non-existent items much more often than others. We intend our evaluation framework and observations to benefit future research on the use of LLMs as recommenders. The code and data are available at https://github.com/JiangDeccc/EvaLLMasRecommender.
  Translation (by gpt-4o-mini)
大規模言語モデル（LLMs）の急速な発展に伴い、最近の研究ではLLMsをレコメンダーとして利用し、異なるユーザーに対してパーソナライズされた情報サービスを提供する試みが行われています。LLMベースのレコメンデーションモデルの精度向上に向けた努力がなされている一方で、ユーティリティを超えた次元に対する関心は比較的少ないです。さらに、LLMベースのレコメンデーションモデルには独自の評価側面があり、これまで大きく無視されてきました。このギャップを埋めるために、私たちは4つの新しい評価次元を探求し、多次元評価フレームワークを提案します。新しい評価次元には、1) 履歴長さの感度、2) 候補位置のバイアス、3) 生成に関与するパフォーマンス、4) 幻覚が含まれます。これら4つの次元はパフォーマンスに影響を与える可能性がありますが、従来のシステムでは考慮される必要がほとんどありません。この多次元評価フレームワークを従来の側面と併せて使用し、7つのLLMベースのレコメンダーのパフォーマンスを評価します。3つのプロンプティング戦略を用いて、4つのデータセットにおけるランキングおよび再ランキングタスクで6つの従来モデルと比較します。私たちは、LLMsがランキング設定において事前知識と短い入力履歴を扱うタスクに優れており、再ランキング設定でも従来モデルを複数の次元で上回ることを発見しました。しかし、LLMsはかなりの候補位置バイアスの問題を示し、一部のモデルは他のモデルよりも存在しないアイテムを幻覚する頻度が高いです。私たちの評価フレームワークと観察結果が、LLMsをレコメンダーとして使用する今後の研究に役立つことを期待しています。コードとデータは、https://github.com/JiangDeccc/EvaLLMasRecommender で入手可能です。
Summary (by gpt-4o-mini)
LLMsをレコメンダーとして利用する際の新たな評価次元を提案し、多次元評価フレームワークを構築。評価次元には履歴長さの感度、候補位置のバイアス、生成パフォーマンス、幻覚が含まれ、7つのLLMベースのレコメンダーを評価。結果、LLMsはランキング設定で優れた性能を示す一方、候補位置バイアスや幻覚の問題も確認。提案フレームワークが今後の研究に貢献することを期待。

AkihikoWatanabe / paper_notes

Beyond Utility: Evaluating LLM as Recommender, Chumeng Jiang+, arXiv'24 #1481

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)