Code of evaluation needed

riddiculous commented 6 months ago

Dear author, I evaluate your results(results/deepseek_spider_validation_set/Predicted.txt) on my own evaluation code(Execution Accuracy), but I found the result(82.7) is not the same as what you provided in the paper(85.5). I wonder if there is any error with mine. Could you please public your evaluation code?

My predicted exec accuracy: easy medium hard extra all
count 248 446 174 166 1034
===================== EXECUTION ACCURACY ===================== execution 0.927 0.901 0.741 0.566 0.827
Thank you!

MohammadrezaPourreza commented 6 months ago

@riddiculous Hi, thank you so much for your interest in this work, for the evaluation we have used the official Spider evaluation script from here. In addition, we also included the screenshot of the evaluation performance generated by the script. We also didn't use these flags: --plug_value --keep_distinct --progress_bar_for_each_datapoint for evaluation. Thanks

starrysky9959 commented 6 months ago

Dear author, I evaluate your results(results/deepseek_spider_validation_set/Predicted.txt) on my own evaluation code(Execution Accuracy), but I found the result(82.7) is not the same as what you provided in the paper(85.5). I wonder if there is any error with mine. Could you please public your evaluation code?

My predicted exec accuracy: easy medium hard extra all count 248 446 174 166 1034 ===================== EXECUTION ACCURACY ===================== execution 0.927 0.901 0.741 0.566 0.827 Thank you!

I get the same result with https://github.com/taoyds/test-suite-sql-eval

riddiculous commented 5 months ago

@riddiculous Hi, thank you so much for your interest in this work, for the evaluation we have used the official Spider evaluation script from here. In addition, we also included the screenshot of the evaluation performance generated by the script. We also didn't use these flags: --plug_value --keep_distinct --progress_bar_for_each_datapoint for evaluation. Thanks

Hi, using the provided script, I still got the same result, just as @starrysky9959 did.

MohammadrezaPourreza commented 5 months ago

@starrysky9959 @riddiculous Thank you for your feedback, we will update the paper and adjust the execution accuracy for the development set of Spider

cometyang commented 5 months ago

@MohammadrezaPourreza , I have difficulty to reproduce the results given in the paper. Could you please give more detailed description on how you did each step in README. Thanks in advance.

MohammadrezaPourreza commented 5 months ago

@cometyang Hi, thank you so much for your interest in our work. I have uploaded the submission file for DTS-SQL paper for BIRD benchmark which is easy to use and you just need to install requirements and run this script. Please make sure to change the path of dataset by changing these two global variables: BASE_DATASET_DIR = "dev.json" BASE_DABATASES_DIR = "./dev_databases/"

cometyang commented 5 months ago

@MohammadrezaPourreza thanks for provide the evaluation code for connect the two models. I am currently evaluating on Spider-syn. In Table 6. it mentioned DeepSeek 7B Upper bound 85.5 78.1, but I only get 79.8 and 72.5, so I wondering whether I did something wrong during training. For DeepSeek 7B Full finetuning, I got similar result 69.1 and 56.1 which veery close to the DeepSeek 7B FTTuning 70.4 56.6 (Tab.6). If I understand paper correctly, if I use the filtered_finuting_dataset.csv for finetuinng the deepseek model & predict against the validation dataset, I should get the upperbound resuts on spider-syn dataset, am I right?

MohammadrezaPourreza commented 5 months ago

Thank you very much, @cometyang, for your interest in our research! I'm curious to know if you have used neftune_noise_alpha, quantization, or perhaps employed LoRA adapters in your experiments? The findings presented in our paper are based on full fine-tuning without the use of quantization or LoRA adapters. Additionally, it's worth noting that in our analysis, neftune_noise_alpha seemed to detrimentally affect performance.

cometyang commented 5 months ago

Hi @MohammadrezaPourreza, thanks for your reply. The reason DTS work looks interesting is it is the currently highest ranked 7B model in the leadboard of (https://bird-bench.github.io/), so I want to dive into the work and understand the gap between ideal situation and the trained model and may find ways for improving.

For the purpose of reproducing, I try to exactly follow the settings in your notebook. Are you suggesting that the code using for paper is different from the shared notebook? If so, could you please also share the code of full fine-tuning (I can change to fp16 and try other hyper-parameters), but i will be appreciated if you can share the setting to reproduce the work so that I can reduce CO2 emission and lhave ess frustration. :-)

Thanks again for sharing the research work, I feel it is interesting that by using two models can have this performance improvement, it is like an agents framework.

I modify the code you shared for BIRD and adapted it to spider-syn, compare to the paper reported number, this is following I obtained below. As you can see, there are noticeable differences, so I wonder where I made mistake.

Deepseek	Paper (Tab 6)	My Experiement	Diff
Full Finetuning (EX)	70.4	69.1	-1.3
Full Finetuning (EM)	56.6	56.1	-0.5
DTS-SQL (EX)	76.2	70.2	-6.0
DTS-SQL (EM)	68.9	62.0	-6.9
Upper bound (EX)	85.5	79.8	-5.7
Upper bound (EM)	78.1	72.5	-5.6

Evaluation command python evaluation.py --gold Gold.txt --pred Pred.txt --db $database_folder$ --eytpe all --table $dataset$/tables.json

kanseaveg commented 4 months ago

@cometyang May I ask whether the results in table3 in the paper are the same as the results you reproduced?

MohammadrezaPourreza / DTS-SQL

Code of evaluation needed #4