Regarding the issue of score mismatch in `MAC-SQL`

kanseaveg commented 6 months ago

While reading the paper, I found that MAC-SQL scored 86.75 on the spider's dev dataset, but in your paper, I only saw a score of 78.6. Is this a replication result? Why is there such a big difference!

FlyingFeather commented 6 months ago

We used the code provided by the author of MAC-SQL, and performed re-runs due to different key versions. There are only two differences in implementation compared to the results provided by the author: first, our key version is gpt4-0613 (as stated in the paper), and second, our evaluation script is Execution with Values in spider, while the MAC-SQL author may have used Execution without Values. As the latter does not consider the impact of specific values, its results are biased towards higher values. If we add the --plug_value parameter, our result in spider-dev will be 88.2.

                     easy                 medium               hard                 extra                all                 
count                248                  446                  174                  166                  1034                
=====================   EXECUTION ACCURACY     =====================
execution            0.927                0.915                0.879                0.729                0.882

In the MAC-SQL Github repository issues, other users have reported significant differences between their reported results and re-run results. The relevant link is: https://github.com/wbbeyourself/MAC-SQL/issues/7

Below is the evaluation command in their code repository:


# python ./evaluation/evaluation_spider.py \
#    --gold "./data/spider/dev_gold.sql" \
#    --db "./data/spider/database" \
#    --table "./data/spider/tables.json" \
#    --pred "./outputs/spider/pred_dev.sql" \
#    --etype "all" \
#    --plug_value \
#    --keep_distinct \
#    --progress_bar_for_each_datapoint```

kanseaveg commented 6 months ago

Thank you very much for your answer. I indeed found in the paper that they used the gpt4-32k version to get the scores, and they used the --plug_value option without predicting corresponding values.

Additionally, I would like to ask, most of the code and methods in DEA-SQL come from DIN-SQL. The Self-consistency is an improved version borrowed from c3sql's self-consistency. Apart from the final active learning step and the addition of some hint bias, I don't seem to see any particularly outstanding highlights. Could you tell me where you think your paper differs or excels compared to DIN-SQL?

FlyingFeather commented 6 months ago

Correcting a small mistake: The fourth module is self-correction, not self-consistency, so it is not related to C3. Many module names are the same, but their implementation and approach are not the same. The method description in the article describes the differences between DIN-SQL and the related modules.

1）Information Determination has a different approach from DIN and other schemes. The main focus is to reduce the amount of information and focus attention. We adopted a two-stage approach, so the effect is higher than other baseline schemes even in the more complex Spider-Realistic dataset. The second step of word selection exploration uses a zero-shot approach. 2）HINTS is completely different from C3's HINTS. We provide some HINTS and links for different problem classifications. Specific details can be found in the relevant sections of the article and Figure 2. 3）SQL generation, this module is commonly referred to as such, but it did not originate from DIN-SQL. We proposed a new prompt structure in this module. 4）Self-correction is based on DIN-SQL, but we found that it did not start from specific problems. Our starting point is that the common types of errors in the LLM should be similar, so we provide correction suggestions for specific problem error types, while DIN-SQL is a general self-correction. 5）The active learning module is a new proposal in the Text2SQL field.

Quoting our paper content:

Our contributions can be summarized as: 1) Propose a workflow paradigm solution to boost the attention of LLMs for complex problems as an example for text-to-SQL tasks; 2) Design a two-stage information filtering module to curtail irrelevant information to enhance the attention of LLMs, while adapting realistic questions with different questioning styles, which performs better on datasets that are closer to realistic questioning styles; 3) Propose a new prompt structure for text-to-SQL tasks. Categorize the problems and use different prompt patterns for different types of problems, presenting the key information to the model in a more explicit way to better improve the performance of the model; 4) The integration of LLMs for self-correction and active learning further improves the model.

Thank you for your interest in our work. We recommend that you take a little time to read our paper, as most of the content is explained and summarized in the text.

FlyingFeather / DEA-SQL

Regarding the issue of score mismatch in `MAC-SQL` #1