Create Question-Relevant Content Pairs for Retrieval Testing

Goal:

Develop a method to generate pairs of questions and relevant content strings from a given dataset, aimed at enhancing retrieval testing. The relevant content should consist of collections of sentences from the source material that are necessary to derive the final answer, rather than being direct answers or chunks themselves. This approach will facilitate better retrieval testing, including that of chunking strategies

Description

For effective retrieval testing, you need question-content pairs where the content is not simply the answer or a chunk of text directly related to the question. Instead, the content should be a curated collection of sentences from various parts of the provided text. These sentences should collectively contain the necessary information to answer the question, making the retrieval challenge more complex and representative of real-world scenarios. The goal is to assess retrieval performance by determining if these critical sentences are included in the chunks retrieved by the system.

Key requirements include:

Ability to input free text (such as a page from a book) and generate question-relevant content pairs from it.
Ensuring the relevant content is sourced from disparate parts of the 'free text', to simulate more challenging retrieval scenarios.
Evaluation of retrieval performance based on whether all these critical sentences are part of the retrieved chunks.

Implementation Details

The project will involve:

Designing an algorithm or model that can analyze free text and identify segments of text that, together, can form the basis of a question-answer pair, with the emphasis on the answer being a coherent collection of information from across the text.
Developing a mechanism for automatically crafting questions based on the identified relevant content, ensuring the questions are clear, concise, and accurately represent the information contained within the content strings.
Implementing a test suite that uses these question-content pairs to evaluate the performance of retrieval systems, specifically looking at the system's ability to fetch chunks containing all parts of the relevant content.
Creating documentation and examples demonstrating how to use the generated pairs for retrieval testing effectively.

Open for collaboration: This project is initially unassigned and open to anyone interested. Discussion and solution proposals can be exchanged in comments. Contributors with impactful pull requests may be considered for assignment.

Product Name

retrieval testing

Organization Name

Samagra

Domain

Data Science / Machine Learning

Tech Skills Needed

Python
Natural Language Processing (NLP)
information retrieval

Feature

Question-Content Pair Generation

Mentor(s)

@ChakshuGautam

Complexity

Medium

Hi @ChakshuGautam . There can be 2 approaches to solve this, the first one involving LLMs, langchains and things like OpenAI call agents but I feel this might be an overkill for now. The second one which I was thinking might be more suitable at the moment. It involves first

preprocessing the dataset to extract individual sentences or segments of text
We can then use named entity recognition (NER), keyword extraction, or topic modeling to identify important elements in the text.
We segment the text into smaller portions based on the identified key concepts
We employ some transformer based model to generate question based on the small text portion
We use paraphrasing techniques to vary the wording of the questions generated
We make validation metrics such as overlap and sentence coverage.

Would like to discuss more and get your opinions.

Samagra-Development / ai-tools