defog-ai / sqlcoder

SoTA LLM for converting natural language questions to SQL queries
Apache License 2.0
3.27k stars 205 forks source link

Reasoning with the sql output #66

Closed addyag93 closed 7 months ago

addyag93 commented 7 months ago

Hi, I have been trying sqlcoder 34b model on my large database using pruning, but model cannot handle out of domain queries well as expected. It always generate sql no matter whether that column is present in the schema or not. I have two questions:

Can you please help with above?

Thanks Aditya

rishsriv commented 7 months ago

Hi Aditya, could you give us a couple of examples of the full prompt that you're using? That'll help us understand the problem you're facing better.

Also, could you clarify what you mean by out of domain queries? Thanks!

addyag93 commented 7 months ago

Hi ,

I am using the same prompt as provided in the repo, just added instructions section after :

"""## Task
                    Generate a SQL query to answer the following question:""" + "\n"+\
                    user_query["query"] + \
                    """\n\n### Instructions
                    - If not sure with the query response, return I don't know."""+\
                    """\n\n ### Database Schema
                    This query will run on a database whose schema is represented in this string:\n"""+\
                    schema +"\n\n"+\
                    """### SQL
                      Given the database schema, here is the SQL query that answers """+user_query["query"] +"\n\n"+\
                    """```sql"""

I will rather call it in-domain out of domain queries like:

Any help would be very useful.

Thanks Aditya

rishsriv commented 7 months ago

Thanks! Could you please share your schema as well please? That will help us understand this better and recommend a better prompt

rishsriv commented 7 months ago

I see – you should use the defog/sqlcoder-7b-2 model for this. The 34B model will soon get these capabilities, but they have not been ported over just yet.

The new 7b model seems to handle this like you would expect! The image below is from our gradio app, which you can run here

image

Closing this for now, but please feel free to reopen if you still have issues!

addyag93 commented 7 months ago

@rishsriv Hi,

I have tested sqlcoder 7b-2 model on this query. Sometimes it gives the answer as above, sometime its hallucinates. I have also tried few in-domain queries with this new model which were working fine with 34b model. For those queries, it sometimes give SELECT 'I do not know' AS answer; Prompt I am using from here Inference parameters:

"do_sample": False,
        "max_new_tokens": 300,
        "temperature": 0.1,
        "repetition_penalty": 1.1,
        "return_full_text":False,
        "num_beams":4,

Can you please help on how to prevent this model from hallucinating.

Thanks Aditya

rishsriv commented 7 months ago

Ah keep your temperature to 0. With temperature 0.1, you will always get some randomness in the results.

Would recommend using the run_inference function here for best results.

addyag93 commented 7 months ago

@rishsriv So i have removed temperature from my sagemaker inference endpoint as it ask for it be positive, not 0.

I can try running using the run_inference function that you have shared instead of SageMaker endpoint, any specific observation from you on using this inference.

Also my metadata.sql with column descriptions, is more than accepted input limit of 2048 tokens, so I am using pruning. Do you think by pruning as I am not giving the model more context of other columns to compare, it might be given random responses?

Thanks Aditya

rishsriv commented 7 months ago

Try setting temperature to a tiny number (like 0.0001) – it will have roughly the same effect and remove randomness.

The Codellama context limit is 100k, so not sure if the input limit is a Sagemaker specific thing. But with temperature=0 (or a number close enough to 0), you will always get deterministic responses that you can then fix with better prompting!

addyag93 commented 7 months ago

@rishsriv By input limit I mean input token length which is 2048 tokens for sqlcoder, so i need to apply pruning for large tables that go beyond token limit of 2048.

Regarding prompting you mean adding more instructions, right? Also is there a way to get explanations as how the sql generated, because this way i can enhance the instructions?

Thanks Aditya