Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
44 stars 110 forks source link

Create Question-Relevant Content Pairs for Retrieval Testing #302

Open Gautam-Rajeev opened 8 months ago

Gautam-Rajeev commented 8 months ago

Goal:

Develop a method to generate pairs of questions and relevant content strings from a given dataset, aimed at enhancing retrieval testing. The relevant content should consist of collections of sentences from the source material that are necessary to derive the final answer, rather than being direct answers or chunks themselves. This approach will facilitate better retrieval testing, including that of chunking strategies

Description

For effective retrieval testing, you need question-content pairs where the content is not simply the answer or a chunk of text directly related to the question. Instead, the content should be a curated collection of sentences from various parts of the provided text. These sentences should collectively contain the necessary information to answer the question, making the retrieval challenge more complex and representative of real-world scenarios. The goal is to assess retrieval performance by determining if these critical sentences are included in the chunks retrieved by the system.

Key requirements include:

Implementation Details

The project will involve:

Open for collaboration: This project is initially unassigned and open to anyone interested. Discussion and solution proposals can be exchanged in comments. Contributors with impactful pull requests may be considered for assignment.

Product Name

retrieval testing

Organization Name

Samagra

Domain

Data Science / Machine Learning

Tech Skills Needed

Category

Feature

Question-Content Pair Generation

Mentor(s)

@ChakshuGautam

Complexity

Medium

kabirrajsingh commented 8 months ago

Hi @ChakshuGautam . There can be 2 approaches to solve this, the first one involving LLMs, langchains and things like OpenAI call agents but I feel this might be an overkill for now. The second one which I was thinking might be more suitable at the moment. It involves first

  1. preprocessing the dataset to extract individual sentences or segments of text
  2. We can then use named entity recognition (NER), keyword extraction, or topic modeling to identify important elements in the text.
  3. We segment the text into smaller portions based on the identified key concepts
  4. We employ some transformer based model to generate question based on the small text portion
  5. We use paraphrasing techniques to vary the wording of the questions generated
  6. We make validation metrics such as overlap and sentence coverage.

Would like to discuss more and get your opinions.