defog-ai / sqlcoder

SoTA LLM for converting natural language questions to SQL queries
Apache License 2.0
3.27k stars 205 forks source link

SQL Metadata Generation #73

Closed yashkhurana24 closed 5 months ago

yashkhurana24 commented 6 months ago

Hi guys, I'm dealing with a huge database (schema is about 63k llama tokens). I understand that the workaround is to use pruning.py from sql_eval. But the task of generating the column descriptions, the column NERs and the column_join is a lot of tedious work. Is there any way to automate this task even a little bit? Maybe a small portion?

Also, hats off to you guys for open-sourcing such an awesome model!

rishsriv commented 6 months ago

Hi there, you should be able to prompt the SQLCoder model to generate descriptions + column_join candidates, likely without having to generate the column NERs. Give it a shot? :D Happy to help if you run into issues around this!

yashkhurana24 commented 6 months ago

Thanks for the quick reply! Could you elaborate a bit more? Are you suggesting that I should pass the entire schema to the SQLCoder model and prompt it to generate individual column descriptions and the column_join candidates and then use the generated results in pruning.py? That sounds like a very interesting approach, just want to confirm if I got it right before I start implementing!

miteshgarg commented 6 months ago

@rishsriv : if possible please explain how we can use the prompting to get the column description and column join candidates?

rishsriv commented 6 months ago

You should be able to just do this table by table! Example below from a schema shared in #75

Prompt It's important to have this end with --, so that the model knows that it's expected to complete the comment

### Instructions
Please modify the database schema below, by adding a comment to each column

### Database Schema
CREATE TABLE cell_kpi_4g (
date VARCHAR,
circle_name VARCHAR,
site_name VARCHAR,
cell_name VARCHAR PRIMARY KEY,
frequency VARCHAR,
bandwidth_mhz VARCHAR,
technology VARCHAR,
vendor_name VARCHAR,
city_name VARCHAR,
pincode_zipcode_postalcode VARCHAR,
dl_data_volume_gb_gigabyte DECIMAL,
ul_data_volume_gb_gigabyte DECIMAL,
volte_voice_traffic_erlang DECIMAL
);

### Modified Schema
CREATE TABLE cell_kpi_4g (
date VARCHAR, --

Output

CREATE TABLE cell_kpi_4g (
date VARCHAR, -- Date of the measurement
circle_name VARCHAR, -- Name of the circle
site_name VARCHAR, -- Name of the site
cell_name VARCHAR PRIMARY KEY, -- Name of the cell
frequency VARCHAR, -- Frequency of the cell
bandwidth_mhz VARCHAR, -- Bandwidth of the cell
technology VARCHAR, -- Technology used in the cell
vendor_name VARCHAR, -- Vendor of the cell
city_name VARCHAR, -- City where the cell is located
pincode_zipcode_postalcode VARCHAR, -- Postal code of the city
dl_data_volume_gb_gigabyte DECIMAL, -- Downlink data volume in GB
ul_data_volume_gb_gigabyte DECIMAL, -- Uplink data volume in GB
volte_voice_traffic_erlang DECIMAL -- Voice traffic in Erlang
);

This is obviously not super accurate, as the model might not know exactly what you mean. But does the job most of the time!

miteshgarg commented 6 months ago

Hi @rishsriv : Just want to share a thought that in case of larger databases with 100s of tables, shouldn't a retriever system (RAG) system be integrated with the fine tuned llm model. I just read about RAG systems somewhere and was thinking if it can be incorporated along with sqlcoder. Not sure, how it can be done but do you have any plan to explore in this direction?

rishsriv commented 6 months ago

Yup we recommend having a RAG system for selecting the right table/column combinations

miteshgarg commented 6 months ago

Yup we recommend having a RAG system for selecting the right table/column combinations

Do you have any such implementation or reference which can be used?

miteshgarg commented 6 months ago

Yup we recommend having a RAG system for selecting the right table/column combinations

Do you have any such implementation or reference which can be used?

Hi @rishsriv: any suggestion on above?

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.