Clarification about workflow

GoogleCloudPlatform / Open_Data_QnA

The Open Data QnA python library enables you to chat with your databases by leveraging LLM Agents on Google Cloud. Open Data QnA enables a conversational approach to interacting with your data by implementing state-of-the-art NL2SQL / Text2SQL methods.

Apache License 2.0

110 stars 37 forks source link

Clarification about workflow #15

Open rragundez opened 4 months ago

rragundez commented 4 months ago

I see that the CLI receives as a parameter user_database , just to clarify:

does this refers to the Vector Database in the Architecture diagram and not to the SQL database?

Are the queries based on a single table or does it also relates several tables by means of joins?

If I understand correctly the LLM prompt creation before hitting the SQL Generator joins a lot of elements, like the query + some context about the tables, can you share an example of that prompt, because I guess it would need to add a lot of information in raw format (not vector format).

orpzs commented 4 months ago

@rragundez The user_database refers to the schema/dataset against which you would ask questions. It filters the various stuff from the vector embedding tables. This essentially helps to decide which source you are trying to talk from the vector embedding table.

Queries can be asked on multiple tables if they are all in the same schema/dataset.

We are soon going to release the v2 of this solution with multi turn support which has prompts pulled outside in a separate file where you can view the prompts. For now to understand the prompt you have to go to each .py files in /agents and /dbconnectors folders

rragundez commented 4 months ago

If I understand you correctly, a caveat is that it will not be able to relate tables if they are in different big query datasets but if they are in the same big query dataset it will be able to relate tables by means of joins in the generated sql query?

Can you point me to the relevant prompt py file before hitting the SQL generator? Thanks.

orpzs commented 4 months ago

Yes, for the main branch your statement holds true. If you have a use case which requires the ability to join tables from different biguqery datasets, please consider using v2-draft branch. Word of caution: v2 version has more features like multi turn support yaml based reading of prompts dynamically etc. Please go through the readme.

On the part regarding code for SQL generation please have a look at /agents/BuildSQLAgent.py

hnegi01 commented 3 months ago

Hello,

Do you think we should generate and add contents for knowgoodsql whenever a successful query is generated. As sometimes for the same question the response is generated successfully and sometimes it gives errors.I was testing using bigquery public dataset sample_ecom.

My concern is if it generated a wrong query(not syntax) which will give wrong sql results and if we store that then we are always using the same sql. Maybe a flag needs to be added in the knowgoodsql to label it using human intervention so that only used when set as correct?

Thanks

orpzs commented 3 months ago

Yes, the functionality is already provided in the demo UI we have. There is a thumbs up button that inserts the question and SQL back to the SQL embedding table. Please have a look at the backend API section where you can see a URI /embed_sql (Backend APIs).