Problem statement

Data.all currently requires users to have technical knowledge of data.all datasets, glue tables, schemas, S3 buckets, folders and SQL querying in order to access and derive insights from the diverse structured and unstructured data sets available across the organization. This creates a significant barrier for non-technical business users who need to quickly and easily query data to make informed decisions. The problem is that there is a lack of intuitive, natural language-based interfaces that allow these users to ask questions in plain English and receive relevant, contextual data responses without requiring SQL expertise or specialized data extraction abilities.

Generative AI models offer a promising solution to bridge this gap by enabling natural language querying capabilities that understand user intent, extract data from both structured and unstructured sources, and generate dynamic responses tailored to the user's needs. This feature aims to empower non-technical users such as business analysts and executive decision makers to query and analyze structured and unstructured data using natural language querying by leveraging Generative AI (GenAI) capabilities to improve data accessibility and data-driven decision-making within data.all.

User Stories

Describe the solution you'd like

US1.

US1. As a Data Consumer, whether a non-technical business user, business analyst, or executive decision maker, I want to be able to query structured data in data.all using natural language, so that I can quickly find and retrieve the insights I need for my applications and decision-making processes.

Acceptance Criteria:

Data consumers can formulate natural language queries to retrieve information from structured data sources with a correction feedback loop to ensure valid SQL generation and execution
Data consumers are presented with the query results in a clear and easy-to-understand format.

US2.

US2. As a Data Consumer, whether a non-technical business user, business analyst, or executive decision maker, I want to be able to query unstructured data sources in data.all using natural language, so that I can quickly find and retrieve the insights I need for my applications and decision-making processes.

Acceptance Criteria:

Data consumers can use natural language to query unstructured data sources containing text documents (e.g., PDF documents, text files), and the system will retrieve and summarize the relevant information.
Data consumers are presented with the natural language query results in a clear and easy-to-understand format with references where appropriate

US3.

As a data.all developer and maintainer, I want the natural language query feature to be secure and respect data governance access permissions.

Acceptance Criteria:

The natural language query feature employs a least privilege model to limit permissions and complies with data.all's security posture.
The natural language query feature is available to only authenticated data.all users and has the same data access permission as the user

US4.

As a data.all developer and maintainer, I want the natural language query feature to be configurable, scalable, reliable, and seamlessly integrated into the data.all platform, so that I can ensure a smooth and efficient user experience for all data.all users.

Acceptance Criteria:

The natural language query feature is modularized and can be turned on and off.
The natural language query functionality is seamlessly integrated into the data.all user interface either in the worksheets module or a dedicated data.all module without significant changes in user experience.

US5.

As a data.all developer and maintainer, I want to be able to configure rate limits for the natural language query feature so that I can prevent overuse and ensure responsible access to the feature.

Acceptance Criteria:

Maintainers can set thresholds for daily use metrics like dollar amount spent or number of queries executed per user.
Once a user hits the configured threshold, the natural language query feature will provide notifications to the user when they reach the usage limits and will be restricted for that user until the next day

US6.

As a data.all developer and maintainer, I want the natural language query feature to clearly display a disclaimer about the limitations and confidentiality of the responses, so that I understand the context and boundaries of the AI-generated information.

Acceptance Criteria:

The natural language query feature UI always presents a disclaimer that cannot be easily missed by the user and states

US7. (Future Scope)

As a data.all developer and maintainer, I want the natural language query feature to provide feedback functionality so that users can easily indicate if the response was helpful or not, which can then be used to improve the quality of future responses.

Acceptance Criteria:

The natural language query feature includes a thumbs up/down widget that users can click to provide feedback on the response which is captured and used to refine and improve the natural language query responses over time.
Users receive a confirmation message after providing feedback, assuring them that their input will be used to enhance the feature.

Scope

Structured Data Query (SQL Generation & Execution)

Data Selection: Select the table that you are interested in executing the natural language query.
Natural Language to SQL Translation: Convert the natural language query into an SQL query.
Query Execution: Run the SQL query on Amazon Athena to retrieve results.
Result Presentation: Display the results to the user in an easily understandable format along with the SQL query generated.
Example User Query: "Show me the total sales for the last quarter."

Generated SQL executed:

SELECT SUM(sales) AS total_sales
FROM sales_data
WHERE quarter = 'last_quarter';

Unstructured Data Query

Data Selection: Select contents from a specified bucket (e.g., S3 bucket with PDF documents).
Contextual Retrieval: Fetch relevant data chunks based on the user's query.
Answer Generation: Generate a coherent response from the retrieved data chunks using an LLM.
Example User Query: "What are the key points in the latest project report relevant to key action for the dev team?"

Example Generated Response: "Summarized key points extracted from the document stored in S3."

The latest project report highlighted several critical issues that require urgent 
attention from the dev team. First, the report identified a major performance 
bottleneck in the backend systems that is causing frequent crashes and slowdowns. 
Addressing this performance issue should be the top priority for the dev team. 
Secondly, the report noted several user-reported bugs in the mobile app that need 
to be fixed in the next release. Finally, the report recommended exploring new 
technologies and frameworks that could improve the overall system architecture and 
developer productivity.

Out Of Scope

Bring Your Own Model: The natural language querying feature will not support the ability for users to bring their own language models.
Fine Tuning: This feature doesn’t include fine tuning of LLM to get a customized model. This has been kept this way as data.all is deployed in a customer environment and due to lack of data on user executed requests and fine-tuning requires a significant size of data to align the model to a particular domain or task.
Chatbot style conversation: This feature currently is single question answering style where user gives context and get answers. Each query execution is independent of previous query requests and depends only on the current context.
Query Knowledge Management: Context given by user for one execution is not explicitly passed for other query executions. In case user needs to use same context, it is the user responsibility to provide these context in different query execution requests.
Role Management: The natural language query feature will assume the same role of a generic analyst persona and will not customized for different user personas.
Query response cache: The natural language query feature will prioritize feature velocity over fine tuning and ensuring cache quality, with the trade off of extra cost for repeated queries

Guardrails

Transparency and Disclosure: Positioning the feature as "experimental" and setting clear expectations with users about the limitations and potential uncertainties like hallucinations in the responses. This helps manage user expectations and encourages a "trust but verify" mindset.
Truthfulness and Integrity: Aligning the GenAI models to "never lie" and avoid generating intentionally false or fabricated information. This is a crucial principle to establish credibility and build user trust.
Continuous Monitoring and Refinement: Closely monitoring the performance and quality of the natural language querying feature, and using user feedback to iteratively improve the underlying GenAI models and response generation processes.
Dynamic Threshold-based Safeguards: Dynamic thresholds and robust controls will be implemented to detect and flag potentially inappropriate, harmful, or biased responses from the GenAI model especially for unstructured data querying.
Clear and Informative Error Messages: The feature will provide clear and contextual guidance to users on how to reformulate their queries when issues are encountered, empowering them to refine their requests for better results.
Cost: Usage will be restricted to a specific metric per day per user to promote responsible use. The choice of model will be determined through a frugal evaluation of functionality and cost. Estimated usage costs will be published to allow customers to make informed decisions.

Design

User Experience

Text To SQL

User navigates to data.all Worksheet view where there will be a button for TextToSQL NLQ. When clicking on the buttons a modal window opens. In this modal a user will:
- Select one or more tables from a drop down for which they want to query data from (at least 1 required)
- Provide a user inputted prompt for what they are trying to query (required)
Once the user is content with their selection, they will press "Generate SQL Response" at the bottom of the modal window
After some processing time, the original WorksheetsView is populated with a SQL query in the top half editor view and the outputted data in the bottom half of the screen
If there are any errors that are thrown - the UI will navigate back to the empty WorksheetView and the error message will be displayed on the top right corner
There will be a limit of “Generate SQL Response” API calls performed per day or per day/team. If the number is surpassed, a comprehensive error message will appear in the top banner.

Document Analysis (Unstructured Use Case)

User navigates to data.all Worksheet view where there will be a button for Document Analysis. When clicking on the buttons a modal window opens.
- Select a dataset. (i.e. S3 Bucket) and (optional) folder from a drop down for which they want to get information from (required)
- Provide a user inputted prompt for what they are trying to query (required)
Once the user is content with their selection, they will press "Send Query Prompt" at the bottom of the modal window
After some processing time, the original WorksheetsView is populated with the response returned from the “Send Query Prompt API“
If there are any errors that are thrown - the UI will navigate back to the empty WorksheetView and the error message will be displayed on the top right corner
There will be a limit of “Send Query Prompt” API calls performed per day or per day/team. If the number is surpassed, a comprehensive error message will appear in the top banner.

There will be a limit of Generate Metadata API calls performed per day or per day/team. If the number is surpassed, a comprehensive error message will appear in the top banner.

Data analysis

For this use-case it is relevant to describe the different types of data and metadata that would serve as input to the generation of metadata. Depending on the data there will be different genAI workflows.

Data.all S3 Datasets: (S3 Bucket + Glue database)

Tables - Glue tables containing structured data. There is technical information stored in the Glue Catalog, including: column labels, column descriptions...
Folders - S3 Prefix that could contain any type of file.

Data.all Redshift Datasets [v.2.7.0] : We need to keep it in mind for the design, but the feature won’t be implementing metadata in Redshift in its first release.

Tables - Redshift tables containing structured data.

Data scenarios

For column metadata generation (column name and column description): Scenario	Input data for genAI	Comments
Glue tables with meaningful column names and description	Use the column description to verify if the name is good and viceversa
Glue tables with no column descriptions and cryptic names	Random selection 100 items of the table (like current preview) + metadata in RDS

For Table and Folder metadata generation: Scenario	Input data for genAI	Comments
Tables with meaningful metadata	Metadata in RDS
Tables with poor metadata	Select randomized items of the table (like current preview) + metadata in RDS
Folders containing files	Read file names and extensions to produce a summary

For Dataset metadata generation Scenario	Input data for genAI	Comments
Folders and Tables with meaningful metadata	Summary of table and folder descriptions
Folders and Tables with poor metadata	Generate metadata for tables and folders and then generate metadata for Dataset

High Level Design

Text To SQL

DA_NLQ_Arch-TextToSQL drawio(2)

User provides an input prompt along with a selection of 1 or more tables from datasets in data.all of which they already have access to
On click of a button takes user provided prompt + Table Identifiers and sends a new GQL Mutation Operation to API Gateway
API Handler receives GQL Mutation request and executes respective resolver in Lambda to start orchestration of processing tasks
FOR EACH TABLE: API Handler Lambda
1. Validates the user has access to the selected table
2. Assumes User’s Environment IAM Role and gets table metadata
API Handler Lambda invokes Bedrock LLM, passing along the user input and the glue metadata for each respective table as context and waits for generated SQL output
Lambda records the invocation and/or associated cost in RDS
The generated SQL Query is presented in the UI for the user to run ad-hoc as needed and or save for later analysis

Document Analyzer

DA_NLQ_Arch-UnstructuredQuery drawio(1)

User provides an input prompt along with a selected file in S3 from a dataset in data.all of which they already have access to
On click of a button takes user provided prompt + S3 Path of File + Dataset Identifier + Access Pattern (i.e. S3 Share, Owner, Folder Share) and sends a new GQL Mutation Operation to API Gateway
API Handler receives GQL Mutation request and executes respective resolver in Lambda to start orchestration of processing tasks
FOR THE SELECTED FILE: API Handler Lambda
1. Validates the user has access to the selected file (either via dataset owner, S3 Bucket Share, OR Folder Share)
2. Assumes Environment IAM Role and downloads the file from S3
API Handler Lambda invokes Bedrock LLM, passing along the user input and the glue metadata for each respective table as context and waits for generated SQL output
API Handler Lambda invokes Bedrock LLM, passing along the user input and the downloaded file as context and waits for a response
After a response is generated - Lambda records the invocation and/or associated cost in RDS
Finally, the generated response is presented to the user

data-dot-all / dataall

Natural Language Querying (NLQ) using genAI #1659

Problem statement

User Stories

US1.

Acceptance Criteria:

US2.

Acceptance Criteria:

US3.

Acceptance Criteria:

US4.

Acceptance Criteria:

US5.

Acceptance Criteria:

US6.

Acceptance Criteria:

US7. (Future Scope)

Acceptance Criteria:

Scope

Structured Data Query (SQL Generation & Execution)

Unstructured Data Query

Out Of Scope

Guardrails

Design

User Experience

Text To SQL

Document Analysis (Unstructured Use Case)

Data analysis

High Level Design

Text To SQL

Document Analyzer