Closed cometyang closed 4 months ago
Hi @cometyang yes we have modified the benchmark recently to reflect the nature of capabilities that are more useful to users.
@wongjingping thanks for clarification. One more question, my test results varied from the blog report but similar. For the reported results, it is average or the best? For example, are u running for 5 times for averaging?
@cometyang Yes a slight variation is expected as we fixed a few smaller issues with the past questions / answers. We use deterministic methods for generation and just 1 run for all of our runners (eg setting seed). While there might be slight differences from the latest reports, I believe the results should be the same between consecutive runs. Please let us know if you notice otherwise.
@wongjingping I tested on sqlcoder-70b-alpha, it reported correctness 87% using sql-eval, which is 0.5% better than gpt-4-0613. Though the number is not 93%, but it indeed an achievement. Just wondering the variance is normal or not. I use vLLM as inference framework.
Hi @cometyang thanks for sharing the discrepancy there. Although the instruct
category is a replacement for the where
category, the questions are mostly similar, except that we now include instructions for these questions, which are added to the start of the prompt to help the model answer the question. However, the 70b model was not trained to follow instructions and my guess is that it might be confused by the instructions (i.e. your results are reasonable and expected). To confirm my hypothesis, can you confirm if you see a drop in the instruct
category (vs other categories)? In addition, I was wondering if you were running the model quantized or in float16?
I run f16 version, also I see drop from date_function, order_by, instruct categories
On Wed, Mar 13, 2024 at 7:37 PM JP @.***> wrote:
Hi @cometyang https://github.com/cometyang thanks for sharing the discrepancy there. Although the instruct category is a replacement for the where category, the questions are mostly similar, except that we now include instructions for these questions, which are added to the start of the prompt to help the model answer the question. However, the 70b model was not trained to follow instructions and my guess is that it might be confused by the instructions. To confirm my hypothesis, can you confirm if you see a drop in the instruct category (vs other categories)? In addition, I was wondering if you were running the model quantized or in float16?
— Reply to this email directly, view it on GitHub https://github.com/defog-ai/sql-eval/issues/94#issuecomment-1996188481, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJ6PFLODI77KLMNMPKDQTYYDWLXAVCNFSM6AAAAABERDULSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGE4DQNBYGE . You are receiving this because you were mentioned.Message ID: @.***>
Hi there, thanks for reporting. Apologies for the delay on this. We finally got the chance to try and replicate this again.
The 70b model was trained on a slightly different prompt that the current prompt.md
file in this directory. It was trained on the prompt below, and is what the model was originally tested with.
### Task
Generate a SQL query to answer the following question:
`{user_question}`
### Database Schema
The query will run on a database with the following schema:
{table_metadata_string}
### Answer
Given the database schema, here is the SQL query that answers `{user_question}`:
```
With vllm and using this prompt, we get 92.5% accuracy (a little down from 93.5% – likely because of minor changes in sql-eval
since then). You can use the command below to replicate this. You can likely get higher accuracy if you use transformers `model.generate directly.
python -W ignore main.py -db postgres -o "results/vllm.csv" -g vllm -f "/path/to/updated_prompt.md" -m defog/sqlcoder-70b-alpha
If you don't get these results, please do let us know what runner and device you're using! Would love to get to the bottom of the cause in variance
@rishsriv I tried your updated prompt, now I also need to change the eval code since now the sql query is after "sql\n", and I got. accuracy 86.5%. My test setup is A100 80G. and I use API runner: '''python main.py \ -db postgres \ -o results.sql70b_old_template.csv \ -g api \ -b 1 \ -f prompts/updated_prompt.md \ --api_url "your ip address/generate" -p 5
Ah, the benchmarks are all done with 4 beams. If using the API runner, you should use b=4
to update the benchmarks. b=1
uses greedy decoding, and the 70b-alpha-model doesn't work nearly as well with that
python main.py
-db postgres
-o results.sql70b_old_template.csv
-g api
-b 4
-f prompts/updated_prompt.md
--api_url "your ip address/generate"
-p 5
change to 4, the results improved to 88.5%.
Huh, that's strange. Unfortunately, not much we can do to help with that. When running the vllm runner on multiple 4xA100 80GB runpod instance with 4 beams (the default config for the vllm runner), we could confirm 92.5% accuracy. Hope you manage to figure it out!
python -W ignore main.py \
-db postgres \
-o "results/vllm.csv" \
-g vllm \
-f "/path/to/updated_prompt.md" \
-m defog/sqlcoder-70b-alpha
Would you mind to share your vllm server command line? I also tried to run on 4xA100.
You can just run a python process using vllm using the code I included above. You do not have to create a vllm server.
I run the model and the postgres database on different machines, that's the reason I created vllm server. The sql-70b's performance (86.5%) is close to sql-7b-v2 (84.5%) in b=1 on my test. I need to test on other types of databases and see the difference. If I understand correctly, the best db type is postgres and but other db should be also okay, right?
it only shows the follow query_categories: date_functions, group_by, instruct, order_by, ratio, table_join.
It different from the blog listed in https://defog.ai/blog/open-sourcing-sqlcoder-70b/: date | group_by | order_by | ratio | join | where |
is "instruct" mapping to "where"?