URL

https://arxiv.org/abs/2411.02571
Authors
- Sheng-Chieh Lin
- Chankyu Lee
- Mohammad Shoeybi
- Jimmy Lin
- Bryan Catanzaro
- Wei Ping
  Abstract
- State-of-the-art retrieval models typically address a straightforward search scenario, where retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose to continually fine-tune the universal multimodal retriever to enhance its text retrieval capability while maintaining multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on MTEB retrieval benchmark. Finally, we explore to prompt the off-the-shelf MLLMs as the zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way to advance universal multimodal retrieval in the future.
  Translation (by gpt-4o-mini)
最先端の情報検索モデルは、通常、固定された検索シナリオに対処しており（例：特定の質問に答えるためのパッセージを見つける）、クエリと取得結果の両方に対して単一のモダリティのみがサポートされています。本論文では、マルチモーダル大規模言語モデル（MLLM）を用いた情報検索の進展に向けた技術を紹介し、複数のモダリティと多様な検索タスクを受け入れる「ユニバーサルマルチモーダル検索」と呼ばれるより広範な検索シナリオを実現します。この目的のために、まず10のデータセットと16の検索タスクにおいて、MLLMをバイエンコーダーリトリーバーとしてファインチューニングすることを検討しました。実証結果は、ファインチューニングされたMLLMリトリーバーが、テキストと画像の両方で構成された難解なクエリを理解する能力を持つことを示していますが、モダリティバイアスのためにクロスモーダル検索タスクではより小型のCLIPリトリーバーに劣ることが分かりました。この問題に対処するために、MLLMリトリーバーが示すモダリティバイアスを軽減するためのモダリティ認識ハードネガティブマイニングを提案します。次に、ユニバーサルマルチモーダルリトリーバーを継続的にファインチューニングし、マルチモーダル検索能力を維持しながらテキスト検索能力を向上させることを提案します。その結果、私たちのモデルMM-Embedは、複数のドメインとタスクにわたるマルチモーダル検索ベンチマークM-BEIRで最先端の性能を達成し、MTEB検索ベンチマークでは最先端のテキスト検索モデルNV-Embed-v1を上回りました。最後に、オフ・ザ・シェルフのMLLMをゼロショットリランキングとしてプロンプトし、マルチモーダルリトリーバーからの候補のランキングを洗練させることを探ります。プロンプトとリランキングを通じて、MLLMはユーザーのクエリ（例：テキストと画像で構成されたクエリ）がより複雑で理解が難しい場合に、マルチモーダル検索をさらに改善できることが分かりました。これらの発見は、将来的にユニバーサルマルチモーダル検索を進展させる道を開くものです。
Summary (by gpt-4o-mini)
本論文では、マルチモーダル大規模言語モデル（MLLM）を用いた「ユニバーサルマルチモーダル検索」の技術を提案し、複数のモダリティと検索タスクに対応する能力を示します。10のデータセットと16の検索タスクでの実験により、MLLMリトリーバーはテキストと画像のクエリを理解できるが、モダリティバイアスによりクロスモーダル検索では劣ることが判明。これを解決するために、モダリティ認識ハードネガティブマイニングを提案し、継続的なファインチューニングでテキスト検索能力を向上させました。結果として、MM-EmbedモデルはM-BEIRベンチマークで最先端の性能を達成し、NV-Embed-v1を上回りました。また、ゼロショットリランキングを通じて、複雑なクエリに対するマルチモーダル検索の改善が可能であることを示しました。これらの成果は、今後のユニバーサルマルチモーダル検索の発展に寄与するものです。

AkihikoWatanabe / paper_notes

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs, Sheng-Chieh Lin+, arXiv'24 #1491

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)