defog-ai / sqlcoder

SoTA LLM for converting natural language questions to SQL queries
Apache License 2.0
3.41k stars 218 forks source link

defog/llama-3-sqlcoder-8b Examples #100

Closed Totoro8697 closed 5 months ago

Totoro8697 commented 6 months ago

Hi, I tried defog/llama-3-sqlcoder-8b but its not create good answers for questions. And I wonder "Am I doing things wrong?". I tried many things, but the model does not generate suitable oracle sql or sqllite statements. So is there any example for usage the new model with together prompt, ddl, question-answer?

Thanks in advance.

rishsriv commented 6 months ago

Hi there, thanks for opening this issue! sqlcoder is optimized for Postgres, Redshift, and Snowfalke – and unfortunately won't yet generate useful SQL for other variants just yet.

We hope to support more sql variants in the next month or so.

Totoro8697 commented 6 months ago

But how should be our prompt, can u demonstrate of it? For example I am using below setup for my data.

prompt = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Generate a SQL query to answer this question: `{user_question}`
You are a data analyst and interpreter. You will be given information about a specific dataset, including its columns and their meanings:
year_month: This column is in year-month format, example 2020-12-01
paycell_active_customer: This column represents the number of active customers of Paycell
dcb_active_customer: Represents active customers in mobile payment (dcb)
paycell_card_active_customer: Represents the number of active users using Paycell card
app_active_customer: Represents the number of active customers on our app
dcb_total_volume: Column representing the total volume of mobile payment (dcb) at the mentioned time
total_card_activation_number: Represents the total activation status of our Paycell card
physical_card_activation_number: This column represents the activation number of active physical cards in Paycell
virtual_card_activation_number: This column represents the activation number of virtual cards
paycell_app_financial_transaction_number: Represents the number of financial transactions made by users using our Paycell app based on their transactions on the relevant date
active_customer_distribution_only_dcb: Represents the number of mobile payment (dcb) on the relevant date in the active customer distribution
active_customer_distribution_only_pcard: Represents the number of Paycell card in the active customer distribution on the relevant date
active_customer_distribution_only_app: Represents the number of app users in the active customer distribution on the relevant date
active_customer_distribution_dcb_and_pcard: Represents the number of users using both mobile payment (dcb) and Paycell card but not using Paycell app on the relevant date in the active customer distribution
active_customer_distribution_dcb_and_app: Represents the number of users using both mobile payment (dcb) and Paycell app but not using Paycell card on the relevant date in the active customer distribution
active_customer_distribution_pcard_and_app: Represents the number of users using both Paycell card and Paycell app but not using mobile payment (dcb) on the relevant date in the active customer distribution
active_customer_distribution_dcb_pcard_and_app: Represents the number of users using Paycell card, Paycell app, and mobile payment (dcb) on the relevant date in the active customer distribution.

If you cannot answer the question with the available database schema, return 'I do not know'

Using this information, you will set up the most suitable SQL statements to answer questions related to the data. These SQL statements will later be executed, so they should be error-free, compatible with Sqllite syntax, and accurately respond to the questions asked. Do not express an opinion or try to explain. Return only the SQL statement. Remember your sql statement will run on SQLite so syntax should be correct. Your output should be clear and void like this output 'Here is the SQL statement to answer the question:'. Also, do not add any comment to the SQL statement which you will generate. Only return SQL statement.

DDL statements:
CREATE TABLE "dashboard_ana_sayfa" (
"year_month" TIMESTAMP,
  "paycell_active_customer" INTEGER,
  "dcb_active_customer" INTEGER,
  "paycell_card_active_customer" INTEGER,
  "app_active_customer" INTEGER,
  "dcb_total_volume" REAL,
  "total_card_activation_count" INTEGER,
  "physical_card_activation_count" INTEGER,
  "virtual_card_activation_count" INTEGER,
  "paycell_app_financial_transaction_count" INTEGER,
  "active_customer_distribution_only_dcb" INTEGER,
  "active_customer_distribution_only_pcard" INTEGER,
  "active_customer_distribution_only_app" INTEGER,
  "active_customer_distribution_dcb_and_pcard" INTEGER,
  "active_customer_distribution_dcb_and_app" INTEGER,
  "active_customer_distribution_pcard_and_app" INTEGER,
  "active_customer_distribution_dcb_pcard_and_app" INTEGER
)<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The following SQL query best answers the question `{user_question}`:
```sql"""

model_name = "defog/llama-3-sqlcoder-8b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if available_memory > 15e9:
    # if you have atleast 15GB of GPU memory, run load the model in float16
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        torch_dtype=torch.float16,
        #load_in_4bit=True,
        device_map="auto",
        use_cache=True,
        #low_cpu_mem_usage=True 
    )
    #model = AutoModelForCausalLM.from_pretrained(
    #    model_name,
    #    trust_remote_code=True,
    #    quantization_config=nf4_config,
    #    use_cache=True
    #)
else:
    # else, load in 8 bits – this is a bit slower
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        # torch_dtype=torch.float16,
        load_in_8bit=True,
        device_map="auto",
        use_cache=True,
    )

import sqlparse

def generate_query(question):
    updated_prompt = prompt.format(user_question=question)
    inputs = tokenizer(updated_prompt, return_tensors="pt").to("cuda")
    generated_ids = model.generate(
        **inputs,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=400,
        do_sample=False,
        #num_beams=4,
    )
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    # empty cache so that you do generate more results w/o memory crashing
    # particularly important on Colab – memory management is much more straightforward
    # when running on an inference service
    #return sqlparse.format(outputs[0].split("[SQL]")[-1], reindent=True)
    return sqlparse.format(outputs[0].split("```sql")[-1], reindent=True)
FaizelK commented 6 months ago

Hi there, thanks for opening this issue! sqlcoder is optimized for Postgres, Redshift, and Snowfalke – and unfortunately won't yet generate useful SQL for other variants just yet.

We hope to support more sql variants in the next month or so.

@rishsriv how do we go about training mysql while we wait for you in the next month? and thanks you so much for actually considering other sql variants (mysql)

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

gramcha commented 3 months ago

I was struggling to understand the example prompt. I checked the API query implementation in this repo to get an idea. You need to ask the question in a specific prompt template. Every time part of the template, replace the ddl, and question.

### Task
Generate a SQL query to answer [QUESTION]{question}[/QUESTION]

### Instructions
- If you cannot answer the question with the available database schema, return 'I do not know'

### Database Schema
The query will run on a database with the following schema:
{ddl}

### Answer
Given the database schema, here is the SQL query that answers [QUESTION]{question}[/QUESTION]
[SQL]

replace {question} with your question and replace (ddl} with table schema

for example:

### Task
Generate a SQL query to answer [QUESTION]show orders that are having sales order number s1 and purchase order number p2 and if order is delayed.[/QUESTION]

### Instructions
- If you cannot answer the question with the available database schema, return 'I do not know'

### Database Schema
The query will run on a database with the following schema:
  CREATE TABLE public.orders (
  id bigserial NOT NULL,
  client_id int8 NOT NULL,
  shipment_type varchar(255) NOT NULL,
  sales_order_number varchar(64) NULL,
  purchase_order_number varchar(64) NULL,
  incoterm varchar(255) NOT NULL,
  order_number varchar(255) NOT NULL,
  status varchar(255) NOT NULL,
  order_date varchar(255) NULL,
  created_by int8 NOT NULL,
  edd varchar(255) NOT NULL,
  created_at timestamptz NOT NULL,
  updated_at timestamptz NOT NULL,
  "type" varchar(255) NULL,
  priority varchar(255) NULL,
  delayed bool NULL,
  route_available bool NULL,
  contract_available bool NULL,
  confirmed_edd varchar(255) NULL,
  confirmed_ship_date varchar(255) NULL,
  CONSTRAINT orders_pkey PRIMARY KEY (id)
);
CREATE INDEX client_id ON public.orders USING btree (id, client_id);
CREATE INDEX client_id_orders ON public.orders USING btree (client_id);
CREATE INDEX id ON public.orders USING btree (id);
CREATE INDEX order_client_id_create_by ON public.orders USING btree (client_id, created_by);
CREATE INDEX order_split_no ON public.orders USING btree (order_number, client_id);

### Answer
Given the database schema, here is the SQL query that answers [QUESTION]show orders that are having sales order number s1 and purchase order number p2 and if order is delayed.[/QUESTION]
  [SQL]

I executed this model using Ollama and the above prompt and was able to get a result.

SELECT o.id FROM public.orders o WHERE o.sales_order_number = 's1' AND o.purchase_order_number = 'p2' AND o.delayed = TRUE;
image

Once you got result before firing the next question, execute /clear for clearing session context; without that, it combines previous questions to current one then behaves unexpected way.