AF Survey - Active Retrieval Augmented Generation

Active Retrieval Augmented Generation

Review author: Pranav Guruprasad

Summary:

The authors of this paper propose FLARE - Forward-Looking Active Retrieval augmented generation, a retrieval-augmented generation (RAG) method which iteratively uses the prediction of the next sentence to retrieve relevant documents necessary, only when the predicted next sentence contains low-confidence tokens. With Large Language Models (LLMs) demonstrating abilities in complex tasks that involve long-form text generation such as long-form QA, open-domain summarization, Chain-of-Thought (CoT) reasoning, etc. they require gathering knowledge throughout the generation process, just like how humans gradually gather information for complex tasks such as writing papers, essays, books, etc. FLARE aims to provide an efficient and intelligent method to achieve this, improving upon static and fixed interval RAG methods.

The authors propose two FLARE methods, FLARE instruct and FLARE direct.

Inspired by Toolformer, in the FLARE instruct method, the LLMs are shown exemplars to generate a “[Search(query]” token when additional information is required, which brings up 2 issues:

LLMs tend to generate fewer queries than necessary
LLMs can also generate excessive search queries, which in turn disrupts answer generation

The authors address these 2 issues using two methods:

Increasing the logit of token “[“ by 2.0 to improve chances of LLMS generating the “[Search(query]” token
After LMs generate a search query, generate the next few tokens while forbidding “[“ by adding a large negative value to the logit of “[“

However, the authors realize that since fine-tuning on black-box LLMs is not interpretable, queries generated by FLARE instruct through retrieval instructions might not be reliable. This leads them to propose FLARE direct.

In the FLARE direct method, at step t, the LM first generates a temporary next sentence without conditioning on retrieved documents. If the LLM is confident about this temporary next sentence, it is accepted for the next step of the task without retrieving additional information. If not, the temporary sentence is used to retrieve documents using 2 methods:

First method, masks out low-confidence tokens in the temporary next sentence (tokens with probabilities lesser than a certain threshold), and uses this masked next sentence as a query to retrieve additional information. The masking ensures removal of potential distractions from the sentence, resulting in improved retrieval accuracy.
Second method generates explicit questions using the temporary next sentence. For example, if the LM is uncertain about a token span such as “University of Pennsylvania”, a question like “Which university did XYZ attend?” Is used to help retrieve relevant information.

For retrieval, the authors use off-the-shelf retrievers such as BM25.

FLARE outperforms all base-lines on various tasks/datasets such as 2WikiMultihopQA, STrategyQA, ASQA, ASQA-hint, and WikiAsp. The authors also conduct extensive ablation studies to analyze the importance of forward-looking retrieval, the importance of active retrieval, and the effectiveness of different query formulation methods.

Motivation:

Improving upon existing static/ fixed interval RAG approaches, by increasing efficiency and accuracy in retrieval-based text generation tasks
Providing a generalized approach to aid LLMs in long-form text generation tasks. LLMs do well in short-form text generation tasks out-of-the-box, but can underperform in other tasks due to length of conversation, increased amount of context provided, and requirement of niche domain knowledge.
Leveraging LLMs’ well-calibrated nature because of which low probability/confidence often indicates a lack of knowledge

Experiments and Results:

The authors tested FLARE on 4 well-known text generation tasks:
- Multihop QA:
  - Goal - answer complex questions through a process of information retrieval and reasoning
  - Dataset - 2WikiMultihop QA
  - Process - Follow Wang et al to generate CoT reasoning process and the final answer, using BM25 as the retriever, and Wikipedia articles as retrieval corpus.
- Commonsense reasoning:
  - Goal - requires system to utilize world and commonsense knowledge to generate an answer
  - Dataset - StrategyQA
  - Process - Follow Wei et al to generate CoT reasoning process and final yes/no answer
- Long-form QA:
  - Goal - To generate comprehensive answers to questions seeking complex information
  - Dataset - ASQA
  - Process - Apart from the actual task, authors create another setting where they provide brief and generic hints to guide LLM to stay on track when generating answers, due to complexity and ambiguity of task (referred to as ASQA-hint). Manually annotate 8 exemplars for few shot learning, and use BM25 over the Wikipedia corpus
- Open-domain Summarization:
  - Goal - Generate a comprehensive summary about a specific topic by gathering information from the open web.
  - Dataset - WikiAsp
  - Process - Converted the original WikiAsp into open-domain setting by removing associated references, and instead gather information from the open web. Manually annotate 4 exemplars for few-shot learning, and use the Bing search engine to retrieve documents from the open web.
FLARE outperforms all baseline on all tasks/datasets, with Multihop QA showing the most significant improvement (due to task’s clear definition and specific objective of producing the answer through a 2-step reasoning process). Improvement on ASQA-hint is larger than that of ASQA as hints reduce ambiguity.
Authors conduct thorough ablations, and find that:
- Forward-looking retrieval is more powerful than past-context-based retrieval
- Depending on tasks/datasets, an average triggering retrieval for 40%-60% of sentences usually leads to good performance.
- Retrieving directly with the complete sentence is worse than masking tokens with low probabilities, thus confirming their hypothesis that low-confidence erroneous tokens can detract retrievers

Limitations:

The authors do not take into consideration the possibility of the entire predicted next sentence being erroneous/ of low confidence. A measure that reflects the confidence of the overall new sentence generated based on the input (like text entailment) will result in a more reliable process of answer generation, when compared to only observing low confidence token spans in the predicted sentence. Especially in the case of long-form text generation tasks, this can lead to cascading errors over the course of the task.
As the authors mention, from an engineering perspective, the interleaving of generation with retrieval using a naive implementation increases both overheads and cost of generation. The LM needs to be activated multiple times, and a caching-free implementation will require recomputing the previous activation each time after a retrieval.
The authors find that FLARE does not provide significant gains on knowledge-intensive dialogue generation datasets like Wizard of Wikipedia, and long-form QA datasets requiring in-depth answers to open-ended questions like ELI5 .Due to issues such as difficulties in grounding generation in retrieval and evaluation, FLARE barely provides improvements over not using retrieval in the above-mentioned tasks.
The authors do not explore, experiment, or conduct ablations with various retrieval methods, and stick to just BM25. For example - Dense Passage Retrieval by Karpukhin et al. , an embedding-based approach, has in the past, beat strong BM25-based information retrieval systems.

Significance:

FLARE presents a novel method to improve RAG, and aids LLMs in performing better and more efficiently in long-form text generation tasks
Can have significant impact in curbing hallucinations by LLMs, and increasing reliability in outputs of LLM-based systems
Provides a generalized approach to build a knowledge-based LLM agent
Research in the RAG direction is very useful, as eventually it will allow users to use LLMs out-of-the-box without any fine-tuning, if they have a well defined Knowledge-Base or corpus of documents, for both - niche and generalized purposes (eg: a conversational agent for a company’s website)

Future work:

This work sets the foundation for active RAG methods, and thus future directions could include looking at more intelligent, and efficient alternatives for active retrieval
Another direction for future work could be to look into the development of LM architectures for efficient active RAG
Exploring better, and more nuanced retrieval methods
Coming up with a metric/ procedure to assess the overall confidence of a sentence being predicted based on a given input, and how well it fits into a long conversation/ text generation task.

Related work:

Using few-shot learning to overcome challenges with respect to grounding to factual/up-to-date information: Internet-augmented language models through few-shot prompting for open-domain question answering by Lazaridou et al. (Deepmind) - Motivated by semi-parametric language models (LMs), which ground their decisions in external retrieved evidence, the authors use few-shot prompting to learn to condition LMs on information returned from the web using Google Search. Their approach does not involve fine-tuning or learning additional parameters, thus making it applicable to any LM, offering a strong baseline. Relevant as it is another approach to ground LLMs in open-domain QA tasks without fine-tuning.
ReFeed, a novel pipeline designed to enhance LLMs by providing automatic retrieval feedback without the need for fine-tuning: Improving Language Models via Plug-and-Play Retrieval Feedback by Yu et al. (University of Notre Dame, Allen Institute for AI) - ReFeeD first generates initial outputs, then utilizes a retrieval model to acquire relevant information from large document collections, and finally incorporates the retrieved information into the in-context demonstration for output refinement, thereby addressing the limitations of LLMs in a more efficient and cost-effective manner. Relevant as it is a very similar approach to FLARE, and addresses the same issues in LLMs.

Paper link: Active Retrieval Augmented Generation

ManifoldRG / Manifold-KB

AF Survey - Active Retrieval Augmented Generation #9