<div align="center""> π¨ Version 2.0.0 is now live! Refer to the Release Notes for detailed information on updates and fixes. π¨
The Open Data QnA python library enables you to chat with your databases by leveraging LLM Agents on Google Cloud.
Open Data QnA enables a conversational approach to interacting with your data. Ask questions about your PostgreSQL or BigQuery databases in natural language and receive informative responses, without needing to write SQL. Open Data QnA leverages Large Language Models (LLMs) to bridge the gap between human language and database queries, streamlining data analysis and decision-making.
Key Features:
It is built on a modular design and currently supports the following components:
Note: the library was formerly named Talk2Data. You may still find artifacts with the old naming in this repository.
A detailed description of the Architecture can be found here
in the docs.
Details on the Repository Structure can be found here
in the docs.
git clone git@github.com:GoogleCloudPlatform/Open_Data_QnA.git
cd Open_Data_QnA
Make sure that Google Cloud CLI and Python are installed before moving ahead! You can refer to the link below for guidance
Installation Guide: https://cloud.google.com/sdk/docs/install
Download Python: https://www.python.org/downloads/
βΉοΈ You can setup this solution with three approaches. Choose one based on your requirements:
Install the dependencies by running the poetry commands below
# Install poetry
pip uninstall poetry -y
pip install poetry --quiet
#Run the poetry commands below to set up the environment
poetry lock #resolve dependecies (also auto create poetry venv if not exists)
poetry install --quiet #installs dependencies
poetry env info #Displays the evn just created and the path to it
poetry shell #this command should activate your venv and you should see it enters into the venv
##inside the activated venv shell []
#If you are running on Worbench instance where the service account used has required permissions to run this solution you can skip the below gcloud auth commands and get to next kernel creation section
gcloud auth login # Use this or below command to authenticate
gcloud auth application-default login
gcloud services enable \
serviceusage.googleapis.com \
cloudresourcemanager.googleapis.com --project <<Enter Project Id>>
Chose the relevant instructions based on where you are running the notebook
For IDEs like Cloud Shell Editor, VS Code
For IDEs adding Juypter Extensions will automatically give you option to change the kernel. If not, manually select the python interpreter in your IDE (The exact is shown in the above cell. Path would look like e.g. /home/admin_/opendata/.venv/bin/python or ~cache/user/opendataqna/.venv/bin/python)
Proceed to the Step 1 below
For Jupyter Lab or Jupyter Environments on Workbench etc
Create Kernel for with the envrionment created
pip install jupyter
ipython kernel install --name "openqna-venv" --user
Restart your kernel or close the exsiting notebook and open again, you should now see the "openqna-venv" in the kernel drop down
What did we do here?
This notebook guides you through the setup and execution of the Open Data QnA application. It provides comprehensive instructions for setup the solution.
This notebook guides you by reading the configuration you setup with 1_Setup_OpenDataQnA and running the pipeline to answer questions about your data.
In case you want to separately load Known Good SQLs please run this notebook once the config variables are setup in config.ini file. This can be run multiple times just to load the known good sql queries and create embeddings for it.
For setup we require details for vector store, source database etc. Edit the config.ini file and add values for the parameters based of below information.
βΉοΈ Follow the guidelines from the config guide document to populate your config.ini file.
Sources to connect
Format for data_source_list.csv
source | user_grouping | schema | table
source - Supported Data Sources. #Options: bigquery , cloudsql-pg
user_grouping - Logical grouping or use case name for tables from same or different schema/dataset. When left black it default to the schema value in the next column
schema - schema name for postgres or dataset name in bigquery
table - name of the tables to run the solutions against. Leave this column blank after filling schema/dataset if you want to run solution for whole dataset/schema
Update the data_source_list.csv according for your requirement.
Note that the source details filled in the csv should have already be present. If not please use the Copy Notebooks if you want the demo source setup.
Enabled Data Sources:
pip install poetry --quiet
poetry lock
poetry install --quiet
poetry env info
poetry shell
Authenticate your credentials
gcloud auth login
or
gcloud auth application-default login
gcloud services enable \
serviceusage.googleapis.com \
cloudresourcemanager.googleapis.com --project <<Enter Project Id>>
gcloud auth application-default set-quota-project <<Enter Project Id for using resources>>
Enable APIs for the solution setup
gcloud services enable \
cloudapis.googleapis.com \
compute.googleapis.com \
iam.googleapis.com \
run.googleapis.com \
sqladmin.googleapis.com \
aiplatform.googleapis.com \
bigquery.googleapis.com \
firestore.googleapis.com --project <<Enter Project Id>>
python env_setup.py
The Open Data QnA SQL Generation tool can be conveniently used from your terminal or command prompt using a simple CLI interface. Here's how:
python opendataqna.py --session_id "122133131f--ade-eweq" --user_question "What is most 5 common genres we have?" --user_grouping "MovieExplorer-bigquery"
Where
session_id : Keep this unique unique same for follow up questions.
user_question : Enter your question in string
user_grouping : Enter the BQ_DATASET_NAME for BigQuery sources or PG_SCHEMA for PostgreSQL sources (refer your data_source_list.csv file)
Optional Parameters
You can customize the pipeline's behavior using optional parameters. Here are some common examples:
# Enable the SQL debugger:
python opendataqna.py --session_id="..." --user_question "..." --user_grouping "..." --run_debugger
# Execute the final generated SQL:
python opendataqna.py --session_id="..." --user_question "..." --user_grouping "..." --execute_final_sql
# Change the number of debugging rounds:
python opendataqna.py --session_id="..." --user_question "..." --user_grouping "..." --debugging_rounds 5
# Adjust similarity thresholds:
python opendataqna.py --session_id="..." --user_question "..." --user_grouping "..." --table_similarity_threshold 0.25 --column_similarity_threshold 0.4
You can find a full list of available options and their descriptions by running:
python opendataqna.py --help
The provided terraform streamlines the setup of this solution and serves as a blueprint for deployment. The script provides a one-click, one-time deployment option. However, it doesn't include CI/CD capabilities and is intended solely for initial setup.
[!NOTE] Current version of the Terraform Google Cloud provider does not support deployment of a few resources, this solution uses null_resource to create those resources using Google Cloud SDK.
Prior to executing terraform, ensure that the below mentioned steps have been completed.
Firebase will be used to host the frontend of the application.
[!NOTE]
Terraform apply command for this application uses gcloud config to fetch & pass the set project id to the scripts. Please ensure that gcloud config has been set to your intended project id before proceeding.[!IMPORTANT]
The Terraform scripts require specific IAM permissions to function correctly. The user needs either the broadroles/resourcemanager.projectIamAdmin
role or a custom role with tailored permissions to manage IAM policies and roles. Additionally, one script TEMPORARILY disables Domain Restricted Sharing Org Policies to enable the creation of a public endpoint. This requires the user to also have theroles/orgpolicy.policyAdmin
role.
cd Open_Data_QnA/terraform
#If you are running this outside Cloud Shell you need to set up your Google Cloud SDK Credentials
gcloud config set project <your_project_id>
gcloud auth application-default set-quota-project <your_project_id>
gcloud services enable \
serviceusage.googleapis.com \
cloudresourcemanager.googleapis.com --project <<Enter Project Id>>
sh ./scripts/deploy-all.sh
This script will perform the following steps:
Auth Provider
You need to enable at least one authentication provider in Firebase, you can enable it using the following steps:
your_project_id
value)This should deploy you end to end solution in the project with firebase web url
For detailed steps and known issues refer to README.md under /terraform
Deploy backend apis for the solution, refer to the README.md under /backend-apis
. This APIs are designed with work with the frontend and provide access to run the solution.
Once the backend APIs deployed successfully deploy the frontend for the solution, refer to the README.md under /frontend
.
If you successfully set up the solution accelerator and want to start optimizing to your needs, you can follow the tips in the Best Practice doc
.
Additionally, if you stumble across any problems, take a look into the FAQ
.
If neither of these resources helps, feel free to reach out to us directly by raising an Issue.
To clean up the resources provisioned in this solution, use commands below to remove them using gcloud/bq:
For cloudsql-pgvector as vector store : Delete SQL Instance
gcloud sql instances delete <CloudSQL Instance Name> -q
Delete BigQuery Dataset Created for Logs and Vector Store : Remove BQ Dataset
bq rm -r -f -d <BigQuery Dataset Name for OpenDataQnA>
(For Backend APIs)Remove the Cloud Run service : Delete Service
gcloud run services delete <Cloud Run Service Name>
For frontend, based on firebase: Remove the firebase app
BigQuery quotas including hardware, software, and network components.
Open Data QnA is distributed with the Apache-2.0 license.
It also contains code derived from the following third-party packages:
This repository provides an open-source solution accelerator designed to streamline your development process. Please be aware that all resources associated with this accelerator will be deployed within your own Google Cloud Platform (GCP) instances.
It is imperative that you thoroughly test all components and configurations in a non-production environment before integrating any part of this accelerator with your production data or systems.
While we strive to provide a secure and reliable solution, we cannot be held responsible for any data loss, service disruptions, or other issues that may arise from the use of this accelerator.
By utilizing this repository, you acknowledge that you are solely responsible for the deployment, management, and security of the resources deployed within your GCP environment.
If you encounter any issues or have concerns about potential risks, please refrain from using this accelerator in a production setting.
We encourage responsible and informed use of this open-source solution.
If you have any questions or if you found any problems with this repository, please report through GitHub issues.