MohammadrezaPourreza / Few-shot-NL2SQL-with-prompting

MIT License
305 stars 61 forks source link

About the Exec Acc in your paper #7

Open BeachWang opened 1 year ago

BeachWang commented 1 year ago

I find that Liu show the Exec Acc is 70.1 in their (Liu et al., 2023a), but there is 60.1 in your paper. Is it a mistake here? Do you have used the same evaluation codes in Exec?

BeachWang commented 1 year ago

Besides, I am confuse about that DIN-SQL have similar Exact match Accs in Table 2 and Table 3 but two significant differences Exec Accs.

MohammadrezaPourreza commented 1 year ago

Thank you so much for pointing these out. First, the exec acc method we are using to evaluate our model is the official metric published here: https://github.com/taoyds/test-suite-sql-eval. This metric which is called "Exec acc" is actually computing the test suite accuracy as also stated in the repo "This repo contains test suite evaluation metric ". Thus we compared our method with the work of (Liu et al., 2023a) in terms of test-suite accuracy and their reported test-suite accuracy is 60.1. Second, table 2 contains the results of our method on the test set of spider and table 3 has the results on the dev set of spider.

BeachWang commented 1 year ago

I use the official metric to eval the result on dev set you publish in the GPT4_results file and the results of Exec acc are 85.1 for DIN-SQL and 80.1 for few-shot. Maybe you have used the different metric in Table 2 and Table 3 I guess?

MohammadrezaPourreza commented 1 year ago

That's interesting, maybe there is a problem with the script we are using. thank you so much for letting us know.

amity871028 commented 1 year ago

I also got a different score about GPT4_results. @MohammadrezaPourreza Could I know what your script is? I use https://github.com/taoyds/test-suite-sql-eval and follow it's steps. I run a script like that: python3 evaluation.py --gold ./my_test/gold_example.txt --pred ./my_test/din_sql_pred_sql.txt --db ./database/--etype exec --plug_value I got these: image if I run a script like that: python3 evaluation.py --gold ./my_test/gold_example.txt --pred ./my_test/din_sql_pred_sql.txt --db ./database/--etype exec and, I got these: image Both of 0.863 and 0.828 are different from your paper's result. I'm so curious which part I run wrongly. Thanks!

MohammadrezaPourreza commented 1 year ago

It's interesting for me as well, many people told me they got different results on the dev set, and even among those the results were not consistent. We are trying to figure out where is the problem.

shuaichenchang commented 1 year ago

Thank you for your great work @MohammadrezaPourreza. I got the same number of 82.8 as @amity871028. I am using the EX accuracy obtained by https://github.com/taoyds/test-suite-sql-eval. I think this is also used as the evaluation script for Spider-test. I am guessing that you were using the https://github.com/taoyds/spider/blob/master/evaluation.py for EX accuracy which always generates a number that is a bit lower than that from the test-suite. Not sure if I am right so ignore it if my guess is wrong.

amity871028 commented 1 year ago

It's interesting for me as well, many people told me they got different results on the dev set, and even among those the results were not consistent. We are trying to figure out where is the problem.

Thank you for your replying! I will wait for your result.

ShiXiangXiang123 commented 1 year ago

It's interesting for me as well, many people told me they got different results on the dev set, and even among those the results were not consistent. We are trying to figure out where is the problem.

Thank you for your replying! I will wait for your result.

哥们帮我看看我的问题可以吗? image 运行后一直这样,不能输入

linxin6 commented 1 year ago

大概是网络问题

ShiXiangXiang123 commented 1 year ago

大概是网络问题

我链接了梯子,还是不行。为什么呢

linxin6 commented 1 year ago

全局开了吗?或者你可以试试用可以国内转发的代理

ShiXiangXiang123 commented 1 year ago

全局开了吗?或者你可以试试用可以国内转发的代理

开的全局

ShiXiangXiang123 commented 1 year ago

Thank you for your great work @MohammadrezaPourreza. I got the same number of 82.8 as @amity871028. I am using the EX accuracy obtained by https://github.com/taoyds/test-suite-sql-eval. I think this is also used as the evaluation script for Spider-test. I am guessing that you were using the https://github.com/taoyds/spider/blob/master/evaluation.py for EX accuracy which always generates a number that is a bit lower than that from the test-suite. Not sure if I am right so ignore it if my guess is wrong.

哥,可以加个微信帮我看看问题吗?15523313206 感激不尽

BeachWang commented 1 year ago

DIN-SQL用的是GPT4,你有GPT4的API key吗?

-----原始邮件----- 发件人:"Shi Xiang Xiang" @.> 发送时间:2023-06-12 17:02:57 (星期一) 收件人: MohammadrezaPourreza/Few-shot-NL2SQL-with-prompting @.> 抄送: BeachWang @.>, Author @.> 主题: Re: [MohammadrezaPourreza/Few-shot-NL2SQL-with-prompting] About the Exec Acc in your paper (Issue #7)

Thank you for your great work @MohammadrezaPourreza. I got the same number of 82.8 as @amity871028. I am using the EX accuracy obtained by https://github.com/taoyds/test-suite-sql-eval. I think this is also used as the evaluation script for Spider-test. I am guessing that you were using the https://github.com/taoyds/spider/blob/master/evaluation.py for EX accuracy which always generates a number that is a bit lower than that from the test-suite. Not sure if I am right so ignore it if my guess is wrong.

哥,可以加个微信帮我看看问题吗?15523313206 感激不尽

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

arian-askari commented 4 months ago

@MohammadrezaPourreza I also got different results when evaluated the https://github.com/MohammadrezaPourreza/Few-shot-NL2SQL-with-prompting/blob/main/GPT4_results/DIN-SQL.csv! Is there any update on this issue? image

This is how I formatted the files for evaluation:

din_sql_gold_evalformat.csv din_sql_prediction_evalformat.csv

My command:

test-suite-sql-eval-master\evaluation.py --gold din_sql_gold_evalformat.csv  --pred din_sql_prediction_evalformat.csv --etype exec --db .\database