defog-ai / sqlcoder

SoTA LLM for converting natural language questions to SQL queries
Apache License 2.0
3.32k stars 210 forks source link

How can I add examples of questions + expected SQL over my own schema (zero-shot style)? #48

Closed orcaman closed 7 months ago

orcaman commented 10 months ago

Some of our user questions are a bit tricky - for example, Geo Spatial queries.

I would like to provide the model with zero-shot examples of user questions and the expected SQL.

I've tried adding this to the prompt (an examples section in the prompt.md). For example this is my prompt markdown:

    # Task
    Generate a SQL query to answer the following question:
    `{user_question}`

    # Database Schema
    The query will run on a database with the following schema:
    {table_metadata_string}

    # Examples
    ## Example 1
    Question: What place has the most cars?
    SQL: SELECT places.name, COUNT(*) as cnt FROM cars JOIN places ON ST_Contains(ST_GeometryFromText(places.wkt_polygon), ST_Point(cars.latitude, cars.longitude)) GROUP BY places.name ORDER BY cnt DESC LIMIT 1;

    ## Example 2
    Question: What is the sum of bookings for each month in 2023?
    SQL: SELECT SUM(total_cost) AS total_cost, EXTRACT(MONTH FROM start_date) as month FROM car_bookings WHERE YEAR(start_date) = 2023 GROUP BY EXTRACT(MONTH FROM start_date) ORDER BY month;

    # SQL

But when I do this, the model always returns the SQL for the first example, regardless of the user question.

Any hints on how to approach this?

Thanks!

NPap0 commented 10 months ago

Some of our user questions are a bit tricky - for example, Geo Spatial queries.

I would like to provide the model with zero-shot examples of user questions and the expected SQL.

I've tried adding this to the prompt (an examples section in the prompt.md). For example this is my prompt markdown:

    # Task
    Generate a SQL query to answer the following question:
    `{user_question}`

    # Database Schema
    The query will run on a database with the following schema:
    {table_metadata_string}

    # Examples
    ## Example 1
    Question: What place has the most cars?
    SQL: SELECT places.name, COUNT(*) as cnt FROM cars JOIN places ON ST_Contains(ST_GeometryFromText(places.wkt_polygon), ST_Point(cars.latitude, cars.longitude)) GROUP BY places.name ORDER BY cnt DESC LIMIT 1;

    ## Example 2
    Question: What is the sum of bookings for each month in 2023?
    SQL: SELECT SUM(total_cost) AS total_cost, EXTRACT(MONTH FROM start_date) as month FROM car_bookings WHERE YEAR(start_date) = 2023 GROUP BY EXTRACT(MONTH FROM start_date) ORDER BY month;

    # SQL

But when I do this, the model always returns the SQL for the first example, regardless of the user question.

Any hints on how to approach this?

Thanks!

Maybe special instructions now can help with this by providing the examples in the special instructions tab? (Talking about the browser demo application)

buddyroo30 commented 8 months ago

Yes, I'd like to know this too. In general, it seems the open source release of their SQL generating LLM is not well supported: there really isn't much documentation for it, and they don't really answer issues that are submitted here at GitHub. I did find this which is a "Schema Cookbook" for their API (presumably a paid product): https://defog.notion.site/Cookbook-for-schema-definitions-1650a6855ea447fdb0be75d39975571b and they cover this info, but it isn't clear how the API calls could be translated into a format you would send to SQLcoder. I guess they are more focused on monetizing this, which makes sense and is their right, but they did release this open source and it would be nice if they supported its use as an open source product too.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

rishsriv commented 7 months ago

Hi there, apologies – I had totally compeletely missed this earlier.

If still relevant (or for future visitors that come in from Google): you can prompt the SQLCoder-7b-2 model with reference queries, like this:


Generate a SQL query to answer [QUESTION]{user_question}[/QUESTION]

### Instructions
{glossary}

### Database Schema
The query will run on a database with the following schema:
{table_metadata_string}

### Reference Queries
If relevant to the question asked, you can use the following queries as references for answering the question:
EXAMPLE QUESTION 1: YOUR QUESTION
EXAMPLE QUERY 1: YOUR_QUERY (ideally as a single line)

EXAMPLE QUESTION 2: YOUR QUESTION
EXAMPLE QUERY 2: YOUR_QUERY (ideally as a single line)

### Answer
Given the database schema, here is the SQL query that answers [QUESTION]{user_question}[/QUESTION]
[SQL]```