defog-ai / sql-eval

Evaluate the accuracy of LLM generated outputs
Apache License 2.0
485 stars 52 forks source link

Enable multiple question files #108

Closed wongjingping closed 3 months ago

wongjingping commented 3 months ago

Tested on vllm and runner:

Preparing /models/combined/sqlcoder_7b_bf16_r128_ds_002_750_b20/checkpoint-700
2024-04-17 09:37:09,387 INFO worker.py:1724 -- Started a local Ray instance.
INFO 04-17 09:37:10 llm_engine.py:72]
...
Using prompt file prompts/prompt.md
Preparing questions...
Using all question(s) from data/instruct_basic_postgres.csv
Prepared 40 question(s) from data/instruct_basic_postgres.csv
Generating completions
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:47<00:00,  1.19s/it]
Time taken: 47.6s
Correct so far: 30/40 (75.00%): 100%|█████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 53.59it/s]
                                   exact_match  correct
query_category                                         
basic_group_order_limit                  0.875    0.875
basic_join_date_group_order_limit        0.625    0.625
basic_join_distinct                      0.750    0.750
basic_join_group_order_limit             0.500    0.500
basic_left_join                          1.000    1.000
Average tokens generated: 65.7
Saved results to results/sqlcoder_7b_bf16_r128_ds_002_750_b20_c700_basic.csv
Using prompt file prompts/prompt.md
Preparing questions...
Using all question(s) from data/instruct_advanced_postgres.csv
Prepared 64 question(s) from data/instruct_advanced_postgres.csv
Generating completions
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [01:53<00:00,  1.77s/it]
Time taken: 113.7s
Correct so far: 33/64 (51.56%): 100%|█████████████████████████████████████████████████████████████████████████████████| 64/64 [00:01<00:00, 51.08it/s]
                              exact_match  correct
query_category                                    
instructions_cte_join               0.375    0.375
instructions_cte_window             0.250    0.250
instructions_date_join              0.625    0.625
instructions_string_matching        0.875    0.875
keywords_aggregate                  0.625    0.625
keywords_ratio                      0.375    0.375
Average tokens generated: 98.0
Saved results to results/sqlcoder_7b_bf16_r128_ds_002_750_b20_c700_advanced.csv
Using prompt file prompts/prompt.md
Preparing questions...
Using all question(s) from data/questions_gen_postgres.csv
Prepared 200 question(s) from data/questions_gen_postgres.csv
Generating completions
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [03:47<00:00,  1.14s/it]
Time taken: 228.2s
Correct so far: 165/200 (82.50%): 100%|██████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:04<00:00, 43.14it/s]
                exact_match   correct
query_category                       
date_functions     0.640000  0.640000
group_by           0.857143  0.857143
instruct           0.857143  0.857143
order_by           0.857143  0.971429
ratio              0.714286  0.828571
table_join         0.685714  0.742857
Average tokens generated: 50.7
Saved results to results/sqlcoder_7b_bf16_r128_ds_002_750_b20_c700_v1.csv

openai:

$ python3 main.py \
  -db postgres \
  -q data/instruct_basic_postgres.csv data/instruct_advanced_postgres.csv \
  -o results/openai_gpt4_turbo_basic.csv results/openai_gpt4_turbo_advanced.csv \
  -g oa \
  -f prompts/prompt_openai.md \
  -m gpt-4-0125-preview \
  -c 0 \
  -p 20
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_basic_postgres.csv
Correct so far: 38/40 (95.00%): 100%|█████████████████████████████████████████████████████████████████████████████████| 40/40 [00:26<00:00,  1.53it/s]
                      query_category  num_rows  mean_correct  mean_error_db_exec
0            basic_group_order_limit         8          1.00               0.000
1  basic_join_date_group_order_limit         8          1.00               0.000
2                basic_join_distinct         8          1.00               0.000
3       basic_join_group_order_limit         8          0.75               0.125
4                    basic_left_join         8          1.00               0.000
Average correct rate: 0.95
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_advanced_postgres.csv
Correct so far: 44/64 (68.75%): 100%|█████████████████████████████████████████████████████████████████████████████████| 64/64 [00:39<00:00,  1.63it/s]
                 query_category  num_rows  mean_correct  mean_error_db_exec
0         instructions_cte_join        16        0.7500               0.125
1       instructions_cte_window         8        0.3750               0.125
2        instructions_date_join        16        0.6875               0.000
3  instructions_string_matching         8        0.8750               0.000
4            keywords_aggregate         8        1.0000               0.000
5                keywords_ratio         8        0.3750               0.125
Average correct rate: 0.69