Open noah-paige opened 2 hours ago
There will be a limit of Generate Metadata API calls performed per day or per day/team. If the number is surpassed, a comprehensive error message will appear in the top banner.
For this use-case it is relevant to describe the different types of data and metadata that would serve as input to the generation of metadata. Depending on the data there will be different genAI workflows.
Data.all S3 Datasets: (S3 Bucket + Glue database)
Data.all Redshift Datasets [v.2.7.0] : We need to keep it in mind for the design, but the feature won’t be implementing metadata in Redshift in its first release.
Data scenarios
For column metadata generation (column name and column description): Scenario |
Input data for genAI | Comments |
---|---|---|
Glue tables with meaningful column names and description | Use the column description to verify if the name is good and viceversa | |
Glue tables with no column descriptions and cryptic names | Random selection 100 items of the table (like current preview) + metadata in RDS |
For Table and Folder metadata generation: Scenario |
Input data for genAI | Comments |
---|---|---|
Tables with meaningful metadata | Metadata in RDS | |
Tables with poor metadata | Select randomized items of the table (like current preview) + metadata in RDS | |
Folders containing files | Read file names and extensions to produce a summary |
For Dataset metadata generation Scenario |
Input data for genAI | Comments |
---|---|---|
Folders and Tables with meaningful metadata | Summary of table and folder descriptions | |
Folders and Tables with poor metadata | Generate metadata for tables and folders and then generate metadata for Dataset |
Problem statement
Data.all currently requires users to have technical knowledge of data.all datasets, glue tables, schemas, S3 buckets, folders and SQL querying in order to access and derive insights from the diverse structured and unstructured data sets available across the organization. This creates a significant barrier for non-technical business users who need to quickly and easily query data to make informed decisions. The problem is that there is a lack of intuitive, natural language-based interfaces that allow these users to ask questions in plain English and receive relevant, contextual data responses without requiring SQL expertise or specialized data extraction abilities.
Generative AI models offer a promising solution to bridge this gap by enabling natural language querying capabilities that understand user intent, extract data from both structured and unstructured sources, and generate dynamic responses tailored to the user's needs. This feature aims to empower non-technical users such as business analysts and executive decision makers to query and analyze structured and unstructured data using natural language querying by leveraging Generative AI (GenAI) capabilities to improve data accessibility and data-driven decision-making within data.all.
User Stories
Describe the solution you'd like
US1.
US1. As a Data Consumer, whether a non-technical business user, business analyst, or executive decision maker, I want to be able to query structured data in data.all using natural language, so that I can quickly find and retrieve the insights I need for my applications and decision-making processes.
Acceptance Criteria:
US2.
US2. As a Data Consumer, whether a non-technical business user, business analyst, or executive decision maker, I want to be able to query unstructured data sources in data.all using natural language, so that I can quickly find and retrieve the insights I need for my applications and decision-making processes.
Acceptance Criteria:
US3.
As a data.all developer and maintainer, I want the natural language query feature to be secure and respect data governance access permissions.
Acceptance Criteria:
US4.
As a data.all developer and maintainer, I want the natural language query feature to be configurable, scalable, reliable, and seamlessly integrated into the data.all platform, so that I can ensure a smooth and efficient user experience for all data.all users.
Acceptance Criteria:
US5.
As a data.all developer and maintainer, I want to be able to configure rate limits for the natural language query feature so that I can prevent overuse and ensure responsible access to the feature.
Acceptance Criteria:
US6.
As a data.all developer and maintainer, I want the natural language query feature to clearly display a disclaimer about the limitations and confidentiality of the responses, so that I understand the context and boundaries of the AI-generated information.
Acceptance Criteria:
US7. (Future Scope)
As a data.all developer and maintainer, I want the natural language query feature to provide feedback functionality so that users can easily indicate if the response was helpful or not, which can then be used to improve the quality of future responses.
Acceptance Criteria:
Scope
Structured Data Query (SQL Generation & Execution)
Unstructured Data Query
Out Of Scope
Guardrails