URL

https://arxiv.org/abs/2212.10496
Authors
- Luyu Gao
- Xueguang Ma
- Jimmy Lin
- Jamie Callan
  Abstract
- While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).
  Translation (by gpt-4o-mini)
密な検索は、タスクや言語を問わず効果的かつ効率的であることが示されていますが、関連性ラベルが利用できない場合に完全なゼロショット密な検索システムを作成することは依然として困難です。本論文では、ゼロショット学習と関連性のエンコーディングの難しさを認識します。その代わりに、仮想文書埋め込み（HyDE）を通じてピボットすることを提案します。クエリが与えられると、HyDEはまず指示に従う言語モデル（例：InstructGPT）にゼロショットで指示を出し、仮想文書を生成します。この文書は関連性パターンを捉えますが、現実には存在せず、誤った詳細を含む可能性があります。次に、教師なしで対照的に学習されたエンコーダ（例：Contriever）が文書を埋め込みベクトルにエンコードします。このベクトルは、コーパス埋め込み空間内の近傍を特定し、ベクトルの類似性に基づいて類似の実際の文書を取得します。この第二のステップは、生成された文書を実際のコーパスに基づかせ、エンコーダの密なボトルネックが誤った詳細をフィルタリングします。実験の結果、HyDEは最先端の教師なし密な検索器Contrieverを大幅に上回り、さまざまなタスク（例：ウェブ検索、QA、事実確認）や言語（例：スワヒリ語、韓国語、日本語）において微調整された検索器と同等の強力なパフォーマンスを示すことが分かりました。
Summary (by gpt-4o-mini)
本研究では、ゼロショット密な検索システムの構築において、仮想文書埋め込み（HyDE）を提案。クエリに基づき、指示に従う言語モデルが仮想文書を生成し、教師なしで学習されたエンコーダがこれを埋め込みベクトルに変換。実際のコーパスに基づく類似文書を取得することで、誤った詳細をフィルタリング。実験結果では、HyDEが最先端の密な検索器Contrieverを上回り、様々なタスクと言語で強力なパフォーマンスを示した。

AkihikoWatanabe / paper_notes

Precise Zero-Shot Dense Retrieval without Relevance Labels, Luyu Gao+, arXiv'22 #1498

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)