defog-ai / sqlcoder

SoTA LLM for converting natural language questions to SQL queries
Apache License 2.0
3.27k stars 205 forks source link

[Edited for clarity] Column name hallucination when different tables have similar names #75

Closed miteshgarg closed 5 months ago

miteshgarg commented 6 months ago

Hi Defog Team,

First of all, I would like to congratulate the team for amazing work with new sqlcoder-7b. It is quite fast and giving pretty accurate queries. One issue I have observed when I am giving similar tables of different granularity like circle and cell then it gets confused.

Whenever there is a question with both cells and circle, the model seems to hallucinate the cell column in circle table. Any suggesiton on how we can fix this?

Query: "circle wise daily count of cells" Answer: SELECT c.circle_name, ck.date, COUNT(DISTINCT ck.cell_name) AS number_of_cells FROM circle_kpi_4g ck JOIN cell_kpi_4g c ON ck.circle_name = c.circle_name GROUP BY c.circle_name, ck.date ORDER BY c.circle_name, ck.date NULLS LAST;

Schema Used: CREATE TABLE circle_kpi_4g ( date VARCHAR, -- Date of KPI captured circle_name VARCHAR PRIMARY KEY, -- Unique Name of Circle dl_data_volume_gb_gigabyte DECIMAL, ul_data_volume_gb_gigabyte DECIMAL, volte_voice_traffic_erlang DECIMAL );

CREATE TABLE cell_kpi_4g ( date VARCHAR, -- Date of KPI captured circle_name VARCHAR, -- Unique Name of Circle site_name VARCHAR, -- Unique Name of Site cell_name VARCHAR PRIMARY KEY, -- Unique Name of Cell frequency VARCHAR, -- Frequency band of Cell bandwidth_mhz VARCHAR, -- Bandwidth for Cell technology VARCHAR, -- Technology of Cell vendor_name VARCHAR, -- Vendor Name of Cell city_name VARCHAR, -- Unique Name of City pincode_zipcode_postalcode VARCHAR, - Unique ID of pincode dl_data_volume_gb_gigabyte DECIMAL, ul_data_volume_gb_gigabyte DECIMAL, volte_voice_traffic_erlang DECIMAL );

miteshgarg commented 6 months ago

Another Issue example could be: count of cells per circle with dl data volume> 10 gb SELECT c.circle_name, COUNT(DISTINCT ck.cell_name) AS number_of_cells FROM circle_kpi_4g ck JOIN cell_kpi_4g c ON ck.circle_name = c.circle_name WHERE ck.Data_Volume_DL_GB > 10 GROUP BY c.circle_name ORDER BY number_of_cells DESC NULLS LAST;

rishsriv commented 6 months ago

Thanks for opening this issue, Mitesh! We're working on this.

We've noticed that the model can sometimes get confused between similar sounding table names. We're currently generating training data to fix this, and should be able to update the model weights by early next week.

Will keep this issue open until then, and tag you once this fixed and the model weights have been updated. Thanks!

miteshgarg commented 6 months ago

Thanks for the quick response @rishsriv . Looking forward for the updated weights. Since you are already working on fixing some issue, there is one more strange observation I want to share:

I am trying to do, day on day comparison. When I put "day on day delta in total data volume", it give correct result, takes lag and calculates delta but when the query is like "tell me day on day delta in total data volume", it just do the daily sum of volume.

Any suggestion to make the model more stable?

rishsriv commented 6 months ago

Thanks for sharing. For specific examples, it's easiest to just add an "Instructions" heading to the prompt. For example, this will solve your problem. You can always append more instructions in this section if needed.

### Task
Generate a SQL query to answer [QUESTION]{question}[/QUESTION]

### Instructions
- if asked for a day on day or month on month comparison, remember to use the LAG function
- some other instruction...
- yet another instruction...

### Database Schema
The query will run on a database with the following schema:
{schema}

### Answer
Given the database schema, here is the SQL query that answers [QUESTION]{question}[/QUESTION]
[SQL]
miteshgarg commented 6 months ago

Thanks for opening this issue, Mitesh! We're working on this.

We've noticed that the model can sometimes get confused between similar sounding table names. We're currently generating training data to fix this, and should be able to update the model weights by early next week.

Will keep this issue open until then, and tag you once this fixed and the model weights have been updated. Thanks!

Hi @rishsriv : Awaiting for the new model weights. Meanwhile, have you also tried to control this hallucination by using the Instructions or Glossary. I was trying with multiple kind of statement but nothing is working.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.