Support CSV and XLSX files

Hey Giuseppe! I’ve had a look at the issue regarding CSV/XLSX support and the general setup. It seems like the current focus is on non-functional aspects like queue setup, event handling, and processing architecture. I wanted to raise a few points related to the functional aspects of how the data is processed, enriched, and embedded once it's ingested. I've summarized these questions below to get your input on some key areas that could help ensure the system handles CSV/XLSX data optimally, especially for the RAG use case.

1. Data Processing Workflow

Should the Lambda function handle only file ingestion from S3, or would you like it to also perform the core data processing (such as context enrichment and embedding)?
- [ ] Yes, the Lambda should handle both ingestion and processing.
- [ ] No, the Lambda should just pass the file to another service for processing.
Comment: Depending on the scope of the Lambda function, we may need to decide if the actual CSV/XLSX processing (parsing, chunking, etc.) will happen inside the Lambda or if it should trigger another service for deeper processing.

2. Context Enrichment for CSV/XLSX Data

Would you prefer to enrich the CSV/XLSX data by appending column headers to each row (to improve the model’s understanding of relationships between values)?
- [ ] Yes, let’s append headers to each row for context.
- [ ] No, we should leave the raw data as is.
Sample Before Context Enrichment:
```
[
 ["001", "Laptop", "Electronics", "999", "15"]
]
```
Sample After Context Enrichment:
```
[
 {
   "Product ID": "001",
   "Name": "Laptop",
   "Category": "Electronics",
   "Price": "999",
   "Quantity": "15"
 }
]
```
Comment: Enriching the data helps the system understand the relationship between values like Price and Quantity, ensuring better query results.

3. Chunking Strategy for CSV/XLSX Data

How would you like to handle chunking for CSV/XLSX files?
- [ ] Process rows in fixed-size chunks (e.g., 100 rows per chunk).
- [ ] Use semantic chunking (based on column or row relationships).
Sample of Fixed-Size Chunking:
```
[
 ["001", "Laptop", "Electronics", "999", "15"],
 ["002", "Smartphone", "Electronics", "799", "25"]
]
```
Sample of Semantic Chunking (by category):
```
[
 {
   "Category": "Electronics",
   "Products": [
     {"Product ID": "001", "Name": "Laptop", "Price": "999", "Quantity": "15"},
     {"Product ID": "002", "Name": "Smartphone", "Price": "799", "Quantity": "25"}
   ]
 }
]
```
Comment: Fixed-size chunking is simpler but might split important information across chunks. Semantic chunking groups logically related rows, which could improve retrieval precision.

4. Embedding Strategy for Structured Data

Would you prefer to implement multi-vector embeddings that separately handle columns (e.g., embedding column headers and values independently)?
- [ ] Yes, let’s use multi-vector embeddings.
- [ ] No, a single embedding for the entire row is sufficient.
Sample of Single Embedding:
```
{
 "Embedding": [0.1, 0.5, 0.2, ...]
}
```
Sample of Multi-Vector Embedding:
```
{
 "Header Embedding": [0.3, 0.7, ...],
 "Value Embedding": [0.5, 0.2, ...]
}
```
Comment: Multi-vector embeddings can enhance how the system understands the relationship between different values and headers, leading to improved results in user queries. This approach might be more computationally expensive but could yield better-quality outputs.

5. Incorporating Exploratory Data Analysis (EDA)

Would you like to include metadata from exploratory data analysis (e.g., averages, standard deviations) as part of the enriched data to enhance model output?
- [ ] Yes, add EDA metadata for better insights.
- [ ] No, EDA metadata is not necessary.
Sample Before EDA:
```
{
 "Product ID": "001",
 "Name": "Laptop",
 "Price": "999"
}
```
Sample After EDA:
```
{
 "Product ID": "001",
 "Name": "Laptop",
 "Price": "999",
 "EDA_Metadata": {
   "Mean_Price": 850,
   "Standard_Deviation": 100
 }
}
```
Comment: Adding EDA metadata can help the model understand broader trends and anomalies in the dataset, potentially leading to more accurate answers, especially for comparative or trend-based queries.

6. Knowledge Graph Integration

Would you prefer to integrate external knowledge (e.g., knowledge graphs) into the enrichment process to provide additional context (e.g., global rankings, manufacturer details)?
- [ ] Yes, integrate knowledge graph data.
- [ ] No, stick with the raw table data.
Sample Before Knowledge Graph Integration:
```
{
 "Product ID": "001",
 "Name": "Laptop"
}
```
Sample After Knowledge Graph Integration:
```
{
 "Product ID": "001",
 "Name": {
   "value": "Laptop",
   "Knowledge_Graph_Metadata": {
     "Manufacturer": "XYZ Corp",
     "Global_Ranking": "Top 10"
   }
 }
}
```
Comment: Knowledge graph integration could significantly enhance the model’s ability to provide richer, more informed responses to queries, especially when querying external relationships (e.g., "Who manufactures this product?").

7. Temporal Contextualization

If your data includes dates (e.g., sales data), should we enrich it with temporal trends (e.g., monthly growth, past performance)?
- [ ] Yes, include temporal trends.
- [ ] No, temporal data isn’t necessary.
Sample Before Temporal Contextualization:
```
{
 "Date": "2023-08-01",
 "Sales": "50"
}
```
Sample After Temporal Contextualization:
```
{
 "Date": "2023-08-01",
 "Sales": "50",
 "Temporal_Metadata": {
   "Previous_Month_Sales": "40",
   "Monthly_Growth": "+25%"
 }
}
```
Comment: Adding temporal metadata helps in answering time-based queries, such as sales trends over time or product performance comparisons across different periods.

8. User Interface Considerations

Would you prefer a separate UI flow for CSV/XLSX uploads, or should we integrate this functionality into the existing flow?
- [ ] Separate UI flow for CSV/XLSX uploads.
- [ ] Integrate CSV/XLSX uploads into the existing UI.
Comment: Depending on the complexity of the file handling process and user expectations, a separate UI flow might improve the user experience, particularly if the system will handle CSV/XLSX files differently from other file types.

Let me know what you think or if anything stands out for further discussion!

Best, Lukas

Proposal:

Tailored Questionnaire-Based Context Enrichment for Tabular Data

Concept Overview

The idea is to integrate a tailored questionnaire into the ingestion workflow for tabular data (such as CSV/XLSX). This process would merge automatic Exploratory Data Analysis (EDA) and LLM-generated assumptions with user-driven insights to create a semantically rich dataset. By generating a questionnaire based on assumptions the LLM makes about the data (e.g., field purposes, relationships), we enable users (typically the data owners) to provide additional contextual information that can't be easily inferred automatically.

This allows for deeper context enrichment and makes the data more meaningful for subsequent processing, embedding, and query generation in Retrieval-Augmented Generation (RAG) systems.

Workflow Summary

Automatic Data Ingestion and EDA:
- After uploading a CSV/XLSX file, the system performs EDA to extract statistical insights (e.g., distributions, correlations) and high-level field analysis.
LLM-Assisted Data Understanding:
- The LLM analyzes the fields and values to form assumptions about the purpose and relationships of different fields.
- Based on this analysis, the LLM generates a set of questions to clarify ambiguities or confirm its assumptions about the dataset.
Tailored Questionnaire Creation:
- A custom questionnaire is generated that asks the data owner to provide further insight into:
  - Field Purposes: What does each field represent?
  - Field Relationships: Are there dependencies, causal relationships, or other connections between fields?
  - Domain-Specific Knowledge: Are there any domain-specific meanings or interpretations that would not be obvious from the data alone?
User Input and Data Refinement:
- The user fills out the questionnaire, and their responses are used to refine the dataset with additional contextual metadata that can be stored alongside the data for use during query processing.
Context-Enriched RAG Queries:
- This enhanced context is utilized to improve the accuracy and relevance of the RAG system's responses, particularly for domain-specific queries.

Example Workflow

Let’s say the data owner uploads a CSV file containing business sales data:

Customer ID	Sales Date	Revenue	Product Code	Region
1001	2023-08-01	999	A123	North
1002	2023-08-02	799	B456	South

Step 1: EDA and LLM Analysis

EDA detects:
- "Revenue" contains numerical values.
- "Sales Date" is a date field.
- "Region" contains categorical values.
LLM Analysis makes assumptions:
- "Revenue" likely represents financial data (e.g., sales or profit).
- "Product Code" seems like an identifier for different products.
- "Region" could indicate geographic areas.

Step 2: Tailored Questionnaire Example

Based on the analysis, the system generates the following tailored questions for the user:

Q1: What does the field "Revenue" represent?

[ ] Sales Amount
[ ] Profit Margin
[ ] Other (please specify)

Q2: Does "Sales Date" impact "Revenue"?
(E.g., are there seasonal patterns or correlations?)

[ ] Yes
[ ] No

Q3: What does "Product Code" represent? (Choose all that apply)

[ ] Different product lines
[ ] Internal tracking codes
[ ] Product pricing category

Q4: What does the field "Region" represent?

[ ] Geographic area for sales
[ ] Sales team segmentation
[ ] Other (please specify)

Step 3: User Input and Metadata Generation

The data owner fills out the questionnaire:

Revenue represents total sales amount.
Sales Date affects Revenue (e.g., higher sales in specific months).
Product Code identifies different product lines.
Region represents geographic sales areas.

The responses are used to generate metadata for each field, which is stored alongside the data. This metadata includes the purpose of fields, relationships between fields, and domain-specific knowledge.

Step 4: Improved RAG Queries

With this enhanced context, the RAG system is better equipped to handle queries. For example:

Query 1: "What was the total revenue in the North region in August 2023?"
- The system can accurately use the Revenue and Region metadata to calculate and return the correct result.
Query 2: "Which product line contributed the most to revenue?"
- The system can now utilize the Product Code and Revenue metadata to identify the highest-grossing product line.

Key Benefits

Combines Automatic and Human Insights:
- By merging automatic EDA and LLM assumptions with the data owner’s domain-specific knowledge, we create a semantically richer dataset that is more accurate and contextual for retrieval.
User-Friendly Data Annotation:
- The tailored questionnaire makes it easy for data owners to provide additional information, even if they are not technically proficient, leading to better data quality and understanding.
Enhanced RAG Accuracy:
- Contextual metadata derived from user input leads to more relevant and accurate responses in the RAG system, especially for complex queries that rely on domain knowledge.
Scalability for Various Domains:
- This approach can be applied to business data, research data, financial reports, and more, providing a flexible way to enrich any tabular dataset.

Next Steps for Implementation

LLM-Assisted Data Exploration:
- After data ingestion, the LLM will analyze field names and values, generating a list of assumptions and hypotheses about the dataset.
Automatic EDA and Tailored Questionnaire Generation:
- The system will generate a custom questionnaire that data owners can easily fill out to provide additional insights about the fields and their relationships.
Metadata Storage and Retrieval:
- User responses will be converted into metadata and stored alongside the dataset for use during query processing, improving the context enrichment of the data.
Query Enhancement:
- This metadata will be used to improve query responses in the RAG system by providing more accurate and semantically relevant answers.

(This tailored questionnaire approach offers a user-friendly and scalable solution to augment the context of tabular data, leveraging both automatic analysis and human insights. )

aws-samples / Serverless-Retrieval-Augmented-Generation-RAG-on-AWS