Closed wongjingping closed 1 month ago
Sample results when running the evals on gpt-4o
with and without CoT:
Normal, without CoT instructions:
$ python3 main.py \
-db postgres \
-q data/instruct_basic_postgres.csv data/instruct_advanced_postgres.csv data/questions_gen_postgres.csv \
-o results/openai_gpt4o_basic.csv results/openai_gpt4o_advanced.csv results/openai_gpt4o_v1.csv \
-g oa \
-f prompts/prompt_openai.md \
-m gpt-4o \
-c 0 \
-p 20
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_basic_postgres.csv
Correct so far: 39/40 (97.50%): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:14<00:00, 2.74it/s]
query_category num_rows mean_correct mean_error_db_exec
0 basic_group_order_limit 8 1.000 0.0
1 basic_join_date_group_order_limit 8 0.875 0.0
2 basic_join_distinct 8 1.000 0.0
3 basic_join_group_order_limit 8 1.000 0.0
4 basic_left_join 8 1.000 0.0
Average correct rate: 0.97
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_advanced_postgres.csv
Correct so far: 47/64 (73.44%): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:25<00:00, 2.54it/s]
query_category num_rows mean_correct mean_error_db_exec
0 instructions_cte_join 16 0.8125 0.000
1 instructions_cte_window 8 0.7500 0.000
2 instructions_date_join 16 0.6250 0.000
3 instructions_string_matching 8 1.0000 0.000
4 keywords_aggregate 8 0.6250 0.000
5 keywords_ratio 8 0.6250 0.125
Average correct rate: 0.73
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/questions_gen_postgres.csv
Correct so far: 110/126 (87.30%): 63%|█████████████████████████████████████████████████████▌ | 126/200 [00:17<00:07, 10.23it/s]/workspace/sql-eval/eval/eval.py:567: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df_gen.fillna(-99999, inplace=True)
Correct so far: 174/197 (88.32%): 98%|███████████████████████████████████████████████████████████████████████████████████▋ | 197/200 [00:30<00:01, 2.78it/s]/workspace/sql-eval/eval/eval.py:568: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df_gold.fillna(-99999, inplace=True)
Correct so far: 175/200 (87.50%): 100%|█████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:31<00:00, 6.38it/s]
query_category num_rows mean_correct mean_error_db_exec
0 date_functions 25 0.760000 0.000000
1 group_by 35 0.971429 0.000000
2 instruct 35 0.885714 0.000000
3 order_by 35 0.885714 0.000000
4 ratio 35 0.800000 0.028571
5 table_join 35 0.914286 0.000000
Average correct rate: 0.88
With Cot instructions:
$ python3 main.py \
-db postgres \
-q data/instruct_basic_postgres.csv data/instruct_advanced_postgres.csv data/questions_gen_postgres.csv \
-o results/openai_gpt4o_basic_cot.csv results/openai_gpt4o_advanced_cot.csv results/openai_gpt4o_v1_cot.csv \
-g oa \
-f prompts/prompt_openai.md \
-m gpt-4o \
-c 0 \
-p 20 \
--cot_table_alias
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_basic_postgres.csv
Correct so far: 38/40 (95.00%): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:18<00:00, 2.22it/s]
query_category num_rows mean_correct mean_error_db_exec
0 basic_group_order_limit 8 1.000 0.0
1 basic_join_date_group_order_limit 8 0.875 0.0
2 basic_join_distinct 8 1.000 0.0
3 basic_join_group_order_limit 8 0.875 0.0
4 basic_left_join 8 1.000 0.0
Average correct rate: 0.95
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/instruct_advanced_postgres.csv
Correct so far: 45/64 (70.31%): 100%|█████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:24<00:00, 2.61it/s]
query_category num_rows mean_correct mean_error_db_exec
0 instructions_cte_join 16 0.8125 0.000
1 instructions_cte_window 8 0.7500 0.000
2 instructions_date_join 16 0.5000 0.000
3 instructions_string_matching 8 1.0000 0.000
4 keywords_aggregate 8 0.6250 0.000
5 keywords_ratio 8 0.6250 0.125
Average correct rate: 0.70
Using prompt file prompts/prompt_openai.md
Preparing questions...
Using all question(s) from data/questions_gen_postgres.csv
Correct so far: 99/113 (87.61%): 56%|████████████████████████████████████████████████▏ | 112/200 [00:16<00:09, 8.97it/s]/workspace/sql-eval/eval/eval.py:567: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df_gen.fillna(-99999, inplace=True)
Correct so far: 176/197 (89.34%): 98%|███████████████████████████████████████████████████████████████████████████████████▋ | 197/200 [00:30<00:00, 3.42it/s]/workspace/sql-eval/eval/eval.py:568: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df_gold.fillna(-99999, inplace=True)
Correct so far: 178/200 (89.00%): 100%|█████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:32<00:00, 6.11it/s]
query_category num_rows mean_correct mean_error_db_exec
0 date_functions 25 0.800000 0.000000
1 group_by 35 1.000000 0.000000
2 instruct 35 0.914286 0.000000
3 order_by 35 0.885714 0.000000
4 ratio 35 0.800000 0.028571
5 table_join 35 0.914286 0.000000
Average correct rate: 0.89
The results are mixed for gpt-4o
, which isn't surprising since I didn't really tune the prompt for CoT.
However, the twist is that adding the CoT instructions didn't product the commented aliases in the output when I inspected what was generated.
cot_instructions
through each runnerrun_checkpoints_cot.sh
for running checkpoints with the cot prompt. We also parameterize both shell scripts to use different ports and cuda devices to allow parallel evaluations if the host has > 1 gpu.