aws-samples / Serverless-Retrieval-Augmented-Generation-RAG-on-AWS

A full-stack serverless RAG workflow. This is thought for running PoCs, prototypes and bootstrap your MVP.
MIT No Attribution
43 stars 15 forks source link

Support CSV and XLSX files #30

Open giusedroid opened 2 months ago

giusedroid commented 2 months ago

basically implement this as an S3 plug-in https://blog.lancedb.com/chat-with-csv-excel-using-lancedb/

luke-b commented 9 hours ago

Hey Giuseppe! I’ve had a look at the issue regarding CSV/XLSX support and the general setup. It seems like the current focus is on non-functional aspects like queue setup, event handling, and processing architecture. I wanted to raise a few points related to the functional aspects of how the data is processed, enriched, and embedded once it's ingested. I've summarized these questions below to get your input on some key areas that could help ensure the system handles CSV/XLSX data optimally, especially for the RAG use case.


1. Data Processing Workflow


2. Context Enrichment for CSV/XLSX Data


3. Chunking Strategy for CSV/XLSX Data


4. Embedding Strategy for Structured Data


5. Incorporating Exploratory Data Analysis (EDA)


6. Knowledge Graph Integration


7. Temporal Contextualization


8. User Interface Considerations


Let me know what you think or if anything stands out for further discussion!

Best, Lukas

luke-b commented 5 hours ago

Proposal:

Tailored Questionnaire-Based Context Enrichment for Tabular Data

Concept Overview

The idea is to integrate a tailored questionnaire into the ingestion workflow for tabular data (such as CSV/XLSX). This process would merge automatic Exploratory Data Analysis (EDA) and LLM-generated assumptions with user-driven insights to create a semantically rich dataset. By generating a questionnaire based on assumptions the LLM makes about the data (e.g., field purposes, relationships), we enable users (typically the data owners) to provide additional contextual information that can't be easily inferred automatically.

This allows for deeper context enrichment and makes the data more meaningful for subsequent processing, embedding, and query generation in Retrieval-Augmented Generation (RAG) systems.

Workflow Summary

  1. Automatic Data Ingestion and EDA:

    • After uploading a CSV/XLSX file, the system performs EDA to extract statistical insights (e.g., distributions, correlations) and high-level field analysis.
  2. LLM-Assisted Data Understanding:

    • The LLM analyzes the fields and values to form assumptions about the purpose and relationships of different fields.
    • Based on this analysis, the LLM generates a set of questions to clarify ambiguities or confirm its assumptions about the dataset.
  3. Tailored Questionnaire Creation:

    • A custom questionnaire is generated that asks the data owner to provide further insight into:
      • Field Purposes: What does each field represent?
      • Field Relationships: Are there dependencies, causal relationships, or other connections between fields?
      • Domain-Specific Knowledge: Are there any domain-specific meanings or interpretations that would not be obvious from the data alone?
  4. User Input and Data Refinement:

    • The user fills out the questionnaire, and their responses are used to refine the dataset with additional contextual metadata that can be stored alongside the data for use during query processing.
  5. Context-Enriched RAG Queries:

    • This enhanced context is utilized to improve the accuracy and relevance of the RAG system's responses, particularly for domain-specific queries.

Example Workflow

Let’s say the data owner uploads a CSV file containing business sales data:

Customer ID Sales Date Revenue Product Code Region
1001 2023-08-01 999 A123 North
1002 2023-08-02 799 B456 South
Step 1: EDA and LLM Analysis
Step 2: Tailored Questionnaire Example

Based on the analysis, the system generates the following tailored questions for the user:

Q1: What does the field "Revenue" represent?

Q2: Does "Sales Date" impact "Revenue"?
(E.g., are there seasonal patterns or correlations?)

Q3: What does "Product Code" represent? (Choose all that apply)

Q4: What does the field "Region" represent?

Step 3: User Input and Metadata Generation

The data owner fills out the questionnaire:

The responses are used to generate metadata for each field, which is stored alongside the data. This metadata includes the purpose of fields, relationships between fields, and domain-specific knowledge.

Step 4: Improved RAG Queries

With this enhanced context, the RAG system is better equipped to handle queries. For example:


Key Benefits

  1. Combines Automatic and Human Insights:

    • By merging automatic EDA and LLM assumptions with the data owner’s domain-specific knowledge, we create a semantically richer dataset that is more accurate and contextual for retrieval.
  2. User-Friendly Data Annotation:

    • The tailored questionnaire makes it easy for data owners to provide additional information, even if they are not technically proficient, leading to better data quality and understanding.
  3. Enhanced RAG Accuracy:

    • Contextual metadata derived from user input leads to more relevant and accurate responses in the RAG system, especially for complex queries that rely on domain knowledge.
  4. Scalability for Various Domains:

    • This approach can be applied to business data, research data, financial reports, and more, providing a flexible way to enrich any tabular dataset.

Next Steps for Implementation

  1. LLM-Assisted Data Exploration:

    • After data ingestion, the LLM will analyze field names and values, generating a list of assumptions and hypotheses about the dataset.
  2. Automatic EDA and Tailored Questionnaire Generation:

    • The system will generate a custom questionnaire that data owners can easily fill out to provide additional insights about the fields and their relationships.
  3. Metadata Storage and Retrieval:

    • User responses will be converted into metadata and stored alongside the dataset for use during query processing, improving the context enrichment of the data.
  4. Query Enhancement:

    • This metadata will be used to improve query responses in the RAG system by providing more accurate and semantically relevant answers.

(This tailored questionnaire approach offers a user-friendly and scalable solution to augment the context of tabular data, leveraging both automatic analysis and human insights. )