defog-ai / sql-eval

Evaluate the accuracy of LLM generated outputs
Apache License 2.0
448 stars 47 forks source link

Add CoT #153

Closed wongjingping closed 1 month ago

wongjingping commented 1 month ago
wongjingping commented 1 month ago

Sample results when running the evals on gpt-4o with and without CoT:

Normal, without CoT instructions:

$ python3 main.py \      
  -db postgres \
  -q data/instruct_basic_postgres.csv data/instruct_advanced_postgres.csv data/questions_gen_postgres.csv \
  -o results/openai_gpt4o_basic.csv results/openai_gpt4o_advanced.csv results/openai_gpt4o_v1.csv \
  -g oa \
  -f prompts/prompt_openai.md \
  -m gpt-4o \
  -c 0 \
  -p 20
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_basic_postgres.csv
Correct so far: 39/40 (97.50%): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:14<00:00,  2.74it/s]
                      query_category  num_rows  mean_correct  mean_error_db_exec
0            basic_group_order_limit         8         1.000                 0.0
1  basic_join_date_group_order_limit         8         0.875                 0.0
2                basic_join_distinct         8         1.000                 0.0
3       basic_join_group_order_limit         8         1.000                 0.0
4                    basic_left_join         8         1.000                 0.0
Average correct rate: 0.97
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_advanced_postgres.csv
Correct so far: 47/64 (73.44%): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:25<00:00,  2.54it/s]
                 query_category  num_rows  mean_correct  mean_error_db_exec
0         instructions_cte_join        16        0.8125               0.000
1       instructions_cte_window         8        0.7500               0.000
2        instructions_date_join        16        0.6250               0.000
3  instructions_string_matching         8        1.0000               0.000
4            keywords_aggregate         8        0.6250               0.000
5                keywords_ratio         8        0.6250               0.125
Average correct rate: 0.73
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/questions_gen_postgres.csv
Correct so far: 110/126 (87.30%):  63%|█████████████████████████████████████████████████████▌                               | 126/200 [00:17<00:07, 10.23it/s]/workspace/sql-eval/eval/eval.py:567: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df_gen.fillna(-99999, inplace=True)
Correct so far: 174/197 (88.32%):  98%|███████████████████████████████████████████████████████████████████████████████████▋ | 197/200 [00:30<00:01,  2.78it/s]/workspace/sql-eval/eval/eval.py:568: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df_gold.fillna(-99999, inplace=True)
Correct so far: 175/200 (87.50%): 100%|█████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:31<00:00,  6.38it/s]
   query_category  num_rows  mean_correct  mean_error_db_exec
0  date_functions        25      0.760000            0.000000
1        group_by        35      0.971429            0.000000
2        instruct        35      0.885714            0.000000
3        order_by        35      0.885714            0.000000
4           ratio        35      0.800000            0.028571
5      table_join        35      0.914286            0.000000
Average correct rate: 0.88

With Cot instructions:

$ python3 main.py \
  -db postgres \
  -q data/instruct_basic_postgres.csv data/instruct_advanced_postgres.csv data/questions_gen_postgres.csv \
  -o results/openai_gpt4o_basic_cot.csv results/openai_gpt4o_advanced_cot.csv results/openai_gpt4o_v1_cot.csv \
  -g oa \
  -f prompts/prompt_openai.md \
  -m gpt-4o \
  -c 0 \
  -p 20 \
  --cot_table_alias
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_basic_postgres.csv
Correct so far: 38/40 (95.00%): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:18<00:00,  2.22it/s]
                      query_category  num_rows  mean_correct  mean_error_db_exec
0            basic_group_order_limit         8         1.000                 0.0
1  basic_join_date_group_order_limit         8         0.875                 0.0
2                basic_join_distinct         8         1.000                 0.0
3       basic_join_group_order_limit         8         0.875                 0.0
4                    basic_left_join         8         1.000                 0.0
Average correct rate: 0.95
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_advanced_postgres.csv
Correct so far: 45/64 (70.31%): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:24<00:00,  2.61it/s]
                 query_category  num_rows  mean_correct  mean_error_db_exec
0         instructions_cte_join        16        0.8125               0.000
1       instructions_cte_window         8        0.7500               0.000
2        instructions_date_join        16        0.5000               0.000
3  instructions_string_matching         8        1.0000               0.000
4            keywords_aggregate         8        0.6250               0.000
5                keywords_ratio         8        0.6250               0.125
Average correct rate: 0.70
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/questions_gen_postgres.csv
Correct so far: 99/113 (87.61%):  56%|████████████████████████████████████████████████▏                                     | 112/200 [00:16<00:09,  8.97it/s]/workspace/sql-eval/eval/eval.py:567: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df_gen.fillna(-99999, inplace=True)
Correct so far: 176/197 (89.34%):  98%|███████████████████████████████████████████████████████████████████████████████████▋ | 197/200 [00:30<00:00,  3.42it/s]/workspace/sql-eval/eval/eval.py:568: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df_gold.fillna(-99999, inplace=True)
Correct so far: 178/200 (89.00%): 100%|█████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:32<00:00,  6.11it/s]
   query_category  num_rows  mean_correct  mean_error_db_exec
0  date_functions        25      0.800000            0.000000
1        group_by        35      1.000000            0.000000
2        instruct        35      0.914286            0.000000
3        order_by        35      0.885714            0.000000
4           ratio        35      0.800000            0.028571
5      table_join        35      0.914286            0.000000
Average correct rate: 0.89

The results are mixed for gpt-4o, which isn't surprising since I didn't really tune the prompt for CoT. However, the twist is that adding the CoT instructions didn't product the commented aliases in the output when I inspected what was generated.