data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
231 stars 82 forks source link

Natural Language Querying (NLQ) using genAI #1659

Open noah-paige opened 2 hours ago

noah-paige commented 2 hours ago

Problem statement

Data.all currently requires users to have technical knowledge of data.all datasets, glue tables, schemas, S3 buckets, folders and SQL querying in order to access and derive insights from the diverse structured and unstructured data sets available across the organization. This creates a significant barrier for non-technical business users who need to quickly and easily query data to make informed decisions. The problem is that there is a lack of intuitive, natural language-based interfaces that allow these users to ask questions in plain English and receive relevant, contextual data responses without requiring SQL expertise or specialized data extraction abilities.

Generative AI models offer a promising solution to bridge this gap by enabling natural language querying capabilities that understand user intent, extract data from both structured and unstructured sources, and generate dynamic responses tailored to the user's needs. This feature aims to empower non-technical users such as business analysts and executive decision makers to query and analyze structured and unstructured data using natural language querying by leveraging Generative AI (GenAI) capabilities to improve data accessibility and data-driven decision-making within data.all.

User Stories

Describe the solution you'd like

US1.

US1. As a Data Consumer, whether a non-technical business user, business analyst, or executive decision maker, I want to be able to query structured data in data.all using natural language, so that I can quickly find and retrieve the insights I need for my applications and decision-making processes.

Acceptance Criteria:

US2.

US2. As a Data Consumer, whether a non-technical business user, business analyst, or executive decision maker, I want to be able to query unstructured data sources in data.all using natural language, so that I can quickly find and retrieve the insights I need for my applications and decision-making processes.

Acceptance Criteria:

US3.

As a data.all developer and maintainer, I want the natural language query feature to be secure and respect data governance access permissions.

Acceptance Criteria:

US4.

As a data.all developer and maintainer, I want the natural language query feature to be configurable, scalable, reliable, and seamlessly integrated into the data.all platform, so that I can ensure a smooth and efficient user experience for all data.all users.

Acceptance Criteria:

US5.

As a data.all developer and maintainer, I want to be able to configure rate limits for the natural language query feature so that I can prevent overuse and ensure responsible access to the feature.

Acceptance Criteria:

US6.

As a data.all developer and maintainer, I want the natural language query feature to clearly display a disclaimer about the limitations and confidentiality of the responses, so that I understand the context and boundaries of the AI-generated information.

Acceptance Criteria:

US7. (Future Scope)

As a data.all developer and maintainer, I want the natural language query feature to provide feedback functionality so that users can easily indicate if the response was helpful or not, which can then be used to improve the quality of future responses.

Acceptance Criteria:

Scope

Structured Data Query (SQL Generation & Execution)

Out Of Scope

Guardrails

noah-paige commented 2 hours ago

Design

User Experience

Text To SQL

Document Analysis (Unstructured Use Case)

There will be a limit of Generate Metadata API calls performed per day or per day/team. If the number is surpassed, a comprehensive error message will appear in the top banner.

Data analysis

For this use-case it is relevant to describe the different types of data and metadata that would serve as input to the generation of metadata. Depending on the data there will be different genAI workflows.

Data.all S3 Datasets: (S3 Bucket + Glue database)

Data.all Redshift Datasets [v.2.7.0] : We need to keep it in mind for the design, but the feature won’t be implementing metadata in Redshift in its first release.

Data scenarios

For column metadata generation (column name and column description):
Scenario
Input data for genAI Comments
Glue tables with meaningful column names and description Use the column description to verify if the name is good and viceversa  
Glue tables with no column descriptions and cryptic names Random selection 100 items of the table (like current preview) + metadata in RDS  
For Table and Folder metadata generation:
Scenario
Input data for genAI Comments
Tables with meaningful metadata Metadata in RDS  
Tables with poor metadata Select randomized items of the table (like current preview) + metadata in RDS  
Folders containing files Read file names and extensions to produce a summary  
For Dataset metadata generation
Scenario
Input data for genAI Comments
Folders and Tables with meaningful metadata Summary of table and folder descriptions  
Folders and Tables with poor metadata Generate metadata for tables and folders and then generate metadata for Dataset  

High Level Design

Text To SQL

DA_NLQ_Arch-TextToSQL drawio(2)

  1. User provides an input prompt along with a selection of 1 or more tables from datasets in data.all of which they already have access to
  2. On click of a button takes user provided prompt + Table Identifiers and sends a new GQL Mutation Operation to API Gateway
  3. API Handler receives GQL Mutation request and executes respective resolver in Lambda to start orchestration of processing tasks
  4. FOR EACH TABLE: API Handler Lambda
    1. Validates the user has access to the selected table
    2. Assumes User’s Environment IAM Role and gets table metadata
  5. API Handler Lambda invokes Bedrock LLM, passing along the user input and the glue metadata for each respective table as context and waits for generated SQL output
  6. Lambda records the invocation and/or associated cost in RDS
  7. The generated SQL Query is presented in the UI for the user to run ad-hoc as needed and or save for later analysis

Document Analyzer

DA_NLQ_Arch-UnstructuredQuery drawio(1)

  1. User provides an input prompt along with a selected file in S3 from a dataset in data.all of which they already have access to
  2. On click of a button takes user provided prompt + S3 Path of File + Dataset Identifier + Access Pattern (i.e. S3 Share, Owner, Folder Share) and sends a new GQL Mutation Operation to API Gateway
  3. API Handler receives GQL Mutation request and executes respective resolver in Lambda to start orchestration of processing tasks
  4. FOR THE SELECTED FILE: API Handler Lambda
    1. Validates the user has access to the selected file (either via dataset owner, S3 Bucket Share, OR Folder Share)
    2. Assumes Environment IAM Role and downloads the file from S3
  5. API Handler Lambda invokes Bedrock LLM, passing along the user input and the glue metadata for each respective table as context and waits for generated SQL output
  6. API Handler Lambda invokes Bedrock LLM, passing along the user input and the downloaded file as context and waits for a response
  7. After a response is generated - Lambda records the invocation and/or associated cost in RDS
  8. Finally, the generated response is presented to the user