Existing information retrieval (IR) models often assume a homogeneous format,limiting their applicability to diverse user needs, such as searching forimages with text descriptions, searching for a news article with a headlineimage, or finding a similar photo with a query image. To approach suchdifferent information-seeking demands, we introduce UniIR, a unifiedinstruction-guided multimodal retriever capable of handling eight distinctretrieval tasks across modalities. UniIR, a single retrieval system jointlytrained on ten diverse multimodal-IR datasets, interprets user instructions toexecute various retrieval tasks, demonstrating robust performance acrossexisting datasets and zero-shot generalization to new tasks. Our experimentshighlight that multi-task training and instruction tuning are keys to UniIR'sgeneralization ability. Additionally, we construct the M-BEIR, a multimodalretrieval benchmark with comprehensive results, to standardize the evaluationof universal multimodal information retrieval.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)