defog-ai / sql-eval

Evaluate the accuracy of LLM generated outputs
Apache License 2.0
485 stars 52 forks source link

Why when I run eval benchamrk, the name of categories is different from blog. #94

Closed cometyang closed 4 months ago

cometyang commented 4 months ago

it only shows the follow query_categories: date_functions, group_by, instruct, order_by, ratio, table_join.

It different from the blog listed in https://defog.ai/blog/open-sourcing-sqlcoder-70b/: date | group_by | order_by | ratio | join | where |

is "instruct" mapping to "where"?

wongjingping commented 4 months ago

Hi @cometyang yes we have modified the benchmark recently to reflect the nature of capabilities that are more useful to users.

cometyang commented 4 months ago

@wongjingping thanks for clarification. One more question, my test results varied from the blog report but similar. For the reported results, it is average or the best? For example, are u running for 5 times for averaging?

wongjingping commented 4 months ago

@cometyang Yes a slight variation is expected as we fixed a few smaller issues with the past questions / answers. We use deterministic methods for generation and just 1 run for all of our runners (eg setting seed). While there might be slight differences from the latest reports, I believe the results should be the same between consecutive runs. Please let us know if you notice otherwise.

cometyang commented 4 months ago

@wongjingping I tested on sqlcoder-70b-alpha, it reported correctness 87% using sql-eval, which is 0.5% better than gpt-4-0613. Though the number is not 93%, but it indeed an achievement. Just wondering the variance is normal or not. I use vLLM as inference framework.

wongjingping commented 4 months ago

Hi @cometyang thanks for sharing the discrepancy there. Although the instruct category is a replacement for the where category, the questions are mostly similar, except that we now include instructions for these questions, which are added to the start of the prompt to help the model answer the question. However, the 70b model was not trained to follow instructions and my guess is that it might be confused by the instructions (i.e. your results are reasonable and expected). To confirm my hypothesis, can you confirm if you see a drop in the instruct category (vs other categories)? In addition, I was wondering if you were running the model quantized or in float16?

cometyang commented 4 months ago

I run f16 version, also I see drop from date_function, order_by, instruct categories

On Wed, Mar 13, 2024 at 7:37 PM JP @.***> wrote:

Hi @cometyang https://github.com/cometyang thanks for sharing the discrepancy there. Although the instruct category is a replacement for the where category, the questions are mostly similar, except that we now include instructions for these questions, which are added to the start of the prompt to help the model answer the question. However, the 70b model was not trained to follow instructions and my guess is that it might be confused by the instructions. To confirm my hypothesis, can you confirm if you see a drop in the instruct category (vs other categories)? In addition, I was wondering if you were running the model quantized or in float16?

— Reply to this email directly, view it on GitHub https://github.com/defog-ai/sql-eval/issues/94#issuecomment-1996188481, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJ6PFLODI77KLMNMPKDQTYYDWLXAVCNFSM6AAAAABERDULSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGE4DQNBYGE . You are receiving this because you were mentioned.Message ID: @.***>

rishsriv commented 4 months ago

Hi there, thanks for reporting. Apologies for the delay on this. We finally got the chance to try and replicate this again.

The 70b model was trained on a slightly different prompt that the current prompt.md file in this directory. It was trained on the prompt below, and is what the model was originally tested with.

### Task
Generate a SQL query to answer the following question:
`{user_question}`

### Database Schema
The query will run on a database with the following schema:
{table_metadata_string}

### Answer
Given the database schema, here is the SQL query that answers `{user_question}`:
```

With vllm and using this prompt, we get 92.5% accuracy (a little down from 93.5% – likely because of minor changes in sql-eval since then). You can use the command below to replicate this. You can likely get higher accuracy if you use transformers `model.generate directly.

python -W ignore main.py   -db postgres   -o "results/vllm.csv"   -g vllm   -f "/path/to/updated_prompt.md"   -m defog/sqlcoder-70b-alpha

If you don't get these results, please do let us know what runner and device you're using! Would love to get to the bottom of the cause in variance

cometyang commented 4 months ago

@rishsriv I tried your updated prompt, now I also need to change the eval code since now the sql query is after "sql\n", and I got. accuracy 86.5%. My test setup is A100 80G. and I use API runner: '''python main.py \ -db postgres \ -o results.sql70b_old_template.csv \ -g api \ -b 1 \ -f prompts/updated_prompt.md \ --api_url "your ip address/generate" -p 5

rishsriv commented 4 months ago

Ah, the benchmarks are all done with 4 beams. If using the API runner, you should use b=4 to update the benchmarks. b=1 uses greedy decoding, and the 70b-alpha-model doesn't work nearly as well with that

python main.py
-db postgres
-o results.sql70b_old_template.csv
-g api
-b 4
-f prompts/updated_prompt.md
--api_url "your ip address/generate"
-p 5
cometyang commented 4 months ago

change to 4, the results improved to 88.5%.

rishsriv commented 4 months ago

Huh, that's strange. Unfortunately, not much we can do to help with that. When running the vllm runner on multiple 4xA100 80GB runpod instance with 4 beams (the default config for the vllm runner), we could confirm 92.5% accuracy. Hope you manage to figure it out!

python -W ignore main.py \
  -db postgres \
  -o "results/vllm.csv" \
  -g vllm \
  -f "/path/to/updated_prompt.md" \
  -m defog/sqlcoder-70b-alpha
cometyang commented 4 months ago

Would you mind to share your vllm server command line? I also tried to run on 4xA100.

rishsriv commented 4 months ago

You can just run a python process using vllm using the code I included above. You do not have to create a vllm server.

cometyang commented 4 months ago

I run the model and the postgres database on different machines, that's the reason I created vllm server. The sql-70b's performance (86.5%) is close to sql-7b-v2 (84.5%) in b=1 on my test. I need to test on other types of databases and see the difference. If I understand correctly, the best db type is postgres and but other db should be also okay, right?