Reasoning with the sql output

addyag93 commented 10 months ago

Hi, I have been trying sqlcoder 34b model on my large database using pruning, but model cannot handle out of domain queries well as expected. It always generate sql no matter whether that column is present in the schema or not. I have two questions:

how to make model to handle out of domain queries via prompting instructions. as it tried appending in prompt.md but its not that accurate.
If I can get the reasoning as to how generated a SQL query, that would also help to mitigate some of out of domain queries. How can we do it via prompting?

Can you please help with above?

Thanks Aditya

rishsriv commented 10 months ago

Hi Aditya, could you give us a couple of examples of the full prompt that you're using? That'll help us understand the problem you're facing better.

Also, could you clarify what you mean by out of domain queries? Thanks!

addyag93 commented 10 months ago

Hi ,

I am using the same prompt as provided in the repo, just added instructions section after :

"""## Task
                    Generate a SQL query to answer the following question:""" + "\n"+\
                    user_query["query"] + \
                    """\n\n### Instructions
                    - If not sure with the query response, return I don't know."""+\
                    """\n\n ### Database Schema
                    This query will run on a database whose schema is represented in this string:\n"""+\
                    schema +"\n\n"+\
                    """### SQL
                      Given the database schema, here is the SQL query that answers """+user_query["query"] +"\n\n"+\
                    """```sql"""

I will rather call it in-domain out of domain queries like:

Which schools have science labs? : though there is no column in my table for science labs, it assumed a column e.g science stream present or not and provided the answer. it should either give reasoning or reply with it don't know.
how many schools are there in say jamnagar? : it miss such queries as well as it don't understand what jamnagar it so map it to state column or sometimes district columns, even after adding "jamnagar" as part of column district_name.

Any help would be very useful.

Thanks Aditya

rishsriv commented 10 months ago

Thanks! Could you please share your schema as well please? That will help us understand this better and recommend a better prompt

rishsriv commented 10 months ago

I see – you should use the defog/sqlcoder-7b-2 model for this. The 34B model will soon get these capabilities, but they have not been ported over just yet.

The new 7b model seems to handle this like you would expect! The image below is from our gradio app, which you can run here

Closing this for now, but please feel free to reopen if you still have issues!

addyag93 commented 10 months ago

@rishsriv Hi,

I have tested sqlcoder 7b-2 model on this query. Sometimes it gives the answer as above, sometime its hallucinates. I have also tried few in-domain queries with this new model which were working fine with 34b model. For those queries, it sometimes give SELECT 'I do not know' AS answer; Prompt I am using from here Inference parameters:

"do_sample": False,
        "max_new_tokens": 300,
        "temperature": 0.1,
        "repetition_penalty": 1.1,
        "return_full_text":False,
        "num_beams":4,

Can you please help on how to prevent this model from hallucinating.

Thanks Aditya

rishsriv commented 10 months ago

Ah keep your temperature to 0. With temperature 0.1, you will always get some randomness in the results.

Would recommend using the run_inference function here for best results.

addyag93 commented 10 months ago

@rishsriv So i have removed temperature from my sagemaker inference endpoint as it ask for it be positive, not 0.

I can try running using the run_inference function that you have shared instead of SageMaker endpoint, any specific observation from you on using this inference.

Also my metadata.sql with column descriptions, is more than accepted input limit of 2048 tokens, so I am using pruning. Do you think by pruning as I am not giving the model more context of other columns to compare, it might be given random responses?

Thanks Aditya

rishsriv commented 10 months ago

Try setting temperature to a tiny number (like 0.0001) – it will have roughly the same effect and remove randomness.

The Codellama context limit is 100k, so not sure if the input limit is a Sagemaker specific thing. But with temperature=0 (or a number close enough to 0), you will always get deterministic responses that you can then fix with better prompting!

addyag93 commented 10 months ago

@rishsriv By input limit I mean input token length which is 2048 tokens for sqlcoder, so i need to apply pruning for large tables that go beyond token limit of 2048.

Regarding prompting you mean adding more instructions, right? Also is there a way to get explanations as how the sql generated, because this way i can enhance the instructions?

Thanks Aditya

defog-ai / sqlcoder

Reasoning with the sql output #66