defog-ai / sql-eval

Evaluate the accuracy of LLM generated outputs
Apache License 2.0
485 stars 52 forks source link

Updated sql-eval questions #121

Closed wongjingping closed 2 months ago

wongjingping commented 2 months ago

Updated questions and answers based on the feedback in here. Added missing newline at the end of prompts/prompt.md Revert join columns header to match training data

Tested on gpt-3.5-turbo to make sure it runs fine:

$ python3 main.py \
  -db postgres \
  -q data/instruct_basic_postgres.csv data/instruct_advanced_postgres.csv data/questions_gen_postgres.csv \
  -o results/openai_gpt3.5_turbo_basic.csv results/openai_gpt3.5_turbo_advanced.csv results/openai_gpt3.5_turbo_v1.csv \
  -g oa \
  -f prompts/prompt_openai.md \
  -m gpt-3.5-turbo-0125 \
  -c 0 \
  -p 20
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_basic_postgres.csv
Correct so far: 31/40 (77.50%): 100%|████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.49it/s]
                      query_category  num_rows  mean_correct  mean_error_db_exec
0            basic_group_order_limit         8         0.750               0.250
1  basic_join_date_group_order_limit         8         0.625               0.250
2                basic_join_distinct         8         0.750               0.000
3       basic_join_group_order_limit         8         0.875               0.125
4                    basic_left_join         8         0.875               0.000
Average correct rate: 0.78
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_advanced_postgres.csv
Correct so far: 35/64 (54.69%): 100%|████████████████████████████████████████████████████████| 64/64 [00:07<00:00,  8.68it/s]
                 query_category  num_rows  mean_correct  mean_error_db_exec
0         instructions_cte_join        16         0.750               0.125
1       instructions_cte_window         8         0.375               0.250
2        instructions_date_join        16         0.375               0.125
3  instructions_string_matching         8         0.875               0.000
4            keywords_aggregate         8         0.625               0.125
5                keywords_ratio         8         0.250               0.125
Average correct rate: 0.55
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/questions_gen_postgres.csv
Correct so far: 148/200 (74.00%): 100%|████████████████████████████████████████████████████| 200/200 [00:11<00:00, 16.86it/s]
   query_category  num_rows  mean_correct  mean_error_db_exec
0  date_functions        25      0.600000            0.120000
1        group_by        35      0.800000            0.028571
2        instruct        35      0.800000            0.057143
3        order_by        35      0.942857            0.000000
4           ratio        35      0.514286            0.171429
5      table_join        35      0.742857            0.085714
Average correct rate: 0.74