AlibabaResearch / DAMO-ConvAI

DAMO-ConvAI: The official repository which contains the codebase for Alibaba DAMO Conversational AI.
MIT License
1.25k stars 188 forks source link

Panel of BIRD Annotation Issues. #39

Open huybery opened 1 year ago

huybery commented 1 year ago

Hi all,

Although BIRD has incurred significant annotation costs, we still cannot guarantee that all the data is accurately labeled. We hope that the community can assist us in building BIRD together! You can continuously report any errors you find under this issue, and we will perform a dataset update at a designated time.

Thanks a lot!

Best, Binyuan

tshu-w commented 1 year ago

Should the db_id movies_4 in train/train.json be movie_4 as there is not movies_4 database in train_databases?

tshu-w commented 1 year ago

There is a typo of this sql in train.json (REL -> REAL)

SELECT CAST(SUM(CASE WHEN T2.DESCRIPTION = 'Allergy to grass pollen' THEN 1 ELSE 0 END) AS REL) * 100 / COUNT(T1.patient) FROM patients AS T1 INNER JOIN allergies AS T2 ON T1.patient = T2.PATIENT WHERE T1.birthplace = 'Pembroke MA US'
tshu-w commented 1 year ago

These two queries also cannot be parsed by sqlglot, but I am not sure if it is the wrong SQL statement or the problem of sqlglot.

SELECT T1.first_name, T1.last_name FROM current AS T1 INNER JOIN `current-terms` AS T2 ON T1.bioguide_id = T2.bioguide WHERE T2.party = 'Republican' AND T1.gender_bio = 'F' AND T2.END > DATE() GROUP BY T1.bioguide_id
SELECT max_temperature_f, date FROM weather WHERE max_temperature_f = (SELECT MAX(max_temperature_f) FROM weather WHERE max_temperature_f IS NOT NULL AND max_temperature_f IS NOT '')
ronch99 commented 1 year ago

Incorrect SQLite database for retail_world. It doesn't match the annotated SQL queries, e.g.:

22842219 commented 1 year ago

also I have found there exists corrupted database, for example, coinmarketcap.sqlite

wbbeyourself commented 1 year ago

Incorrect evidence in dev set.

在 california_schools/frpm.csv 中,eligible free rate = Free Meal Count / Enrollment,同时 column 【Percent (%) Eligible Free (K-12)】确实是 【Free Meal Count (K-12)】 / 【Enrollment (K-12)】。

但是在 dev set中第一条测试数据就存在错误:

原始数据:

db_id: california_schools question: What is the highest eligible free rate for K-12 students in the schools in Alameda County? evidence: Eligible free rate for K-12 = FRPM Count (K-12) / Enrollment (K-12) Gold SQL: SELECT FRPM Count (K-12) / Enrollment (K-12) FROM frpm WHERE County Name = 'Alameda' ORDER BY (CAST(FRPM Count (K-12) AS REAL) / Enrollment (K-12)) DESC LIMIT 1

两处错误: (1) evidence 错误,正确的应该为 : Eligible free rate for K-12 = Free Meal Count (K-12) / Enrollment (K-12) (2) Gold SQL 写得复杂了,除法是多此一举。直接写成如下的即可,无需除法。因为 【Percent (%) Eligible Free (K-12)】就代表了 Eligible free rate for K-12

SELECT MAX(Percent (%) Eligible Free (K-12)) FROM frpm WHERE County Name = 'Alameda';

或者

SELECT Free Meal Count (K-12) / Enrollment (K-12) FROM frpm WHERE County Name = 'Alameda' ORDER BY (CAST(Free Meal Count (K-12) AS REAL) / Enrollment (K-12)) DESC LIMIT 1

类似的问题可能还有不少。

accpatrick commented 1 year ago

@tshu-w @wbbeyourself @ronch99 @22842219 Thank you for your interest in our work! We are pleased to announce the release of an updated and cleaner version of the development set. To access the new dev dataset, you can either download it from our website or use the following direct link: https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip.

The latest update includes the following improvements:

  1. Thoroughly reviewed and corrected all gold SQL errors.
  2. Revised Question/SQLs with null results for better clarity and less false positives.
  3. Fixed errors in some columns of the database description files.
  4. Completed the contents of two previously blank tables.

As a result of these enhancements, the ChatGPT (gpt-3.5-turbo) and GPT4 (gpt-4-32k) EX scores have improved to 42.24 (from 37.22) and 49.15 (from 46.35), respectively. We will update the leaderboard with all new results shortly.

Please note that the test datasets remain unchanged at this time since we have already devoted much more considerable effort in examining errors, and human evaluation during the construction of test datasets. This can ensure its quality.

BTW, our paper has been accepted by NeurIPS 2023 as a Spotlight. Thanks for your support and suggestions! We will keep optimizing our work, thanks!

josem7 commented 1 year ago

In train datast I think this example is wrong

{
    "db_id": "movie_platform",
    "question": "What is the name of the longest movie title? When was it released?",
    "evidence": "longest movie title refers to MAX(LENGTH(movie_title)); when it was released refers to movie_release_year;",
    "SQL": "SELECT movie_title, movie_release_year FROM movies ORDER BY LENGTH(movie_popularity) DESC LIMIT 1"
},

Should be

"SQL": "SELECT movie_title, movie_release_year FROM movies ORDER BY LENGTH(movie_title) DESC LIMIT 1"

josem7 commented 1 year ago
{
    "db_id": "movie_platform",
    "question": "Name the movie with the most ratings.",
    "evidence": "movie with the most rating refers to MAX(SUM(rating_score));",
    "SQL": "SELECT movie_title FROM movies GROUP BY movie_title ORDER BY COUNT(movie_title) DESC LIMIT 1"
},

This example also seems wrong as it should be counting the amount of ratings in the ratings table joined with the movies table (to extract the movie title)
josem7 commented 1 year ago

question 24 in dev dataset might be missing the evidence

{
    "question_id": 24,
    "db_id": "california_schools",
    "question": "Give the names of the schools with the percent eligible for free meals in K-12 is more than 0.1 and test takers whose test score is greater than or equal to 1500?",
    "evidence": "",
    "SQL": "SELECT T2.`School Name` FROM satscores AS T1 INNER JOIN frpm AS T2 ON T1.cds = T2.CDSCode WHERE CAST(T2.`Free Meal Count (K-12)` AS REAL) / T2.`Enrollment (K-12)` > 0.1 AND T1.NumGE1500 > 0",
    "difficulty": "moderate"
},
tshu-w commented 9 months ago

https://arxiv.org/abs/2402.12243

We found that noise in questions and gold queries are preva- lent in the dataset, with varying amounts across domains, and with an uneven distribution be- tween noise types. The presence of incorrect gold SQL queries, which then generate incor- rect gold answers, has a significant impact on the benchmark’s reliability. Surprisingly, when evaluating models on corrected SQL queries, zero-shot baselines surpassed the performance of state-of-the-art prompting methods. We con- clude that informative noise labels and reliable benchmarks are crucial to developing new Text- to-SQL methods that can handle varying types of noise.

accpatrick commented 9 months ago

@tshu-w @wbbeyourself @josem7 @ronch99 Thanks for your kind reminder. We are currently starting the third round of reviews for the development set. This time, we also pay more attention to questions, evidence sentences and database description files instead of just SQLs.

Of course, we will thoroughly consider your findings and make the necessary corrections during this review.

After this, we will also review the training set when our expert students and professors are available. We sincerely appreciate your support and suggestions throughout this process.

It is indeed a challenging and time-consuming task to meticulously go through all examples and ensure their quality especially our data is more complex. We aim to release this version around late March or early April. Thanks for your understanding.

accpatrick commented 9 months ago

@tshu-w, thank you for sharing this insightful paper. We highly value such research that judges the quality of benchmarks. A similar paper from EMNLP (https://arxiv.org/pdf/2310.18538.pdf) also provides a comprehensive examination of Spider, BIRD, and Spider-DK. These papers are crucial for advancing our field.

The paper (https://arxiv.org/abs/2402.12243) presents an important research problem on how to develop robust models to deal with noise in the benchmark. However, some details are unclear, making it difficult for us to fully understand. For instance, 1) the authors mention random sampling of examples and a 49% error rate in the california_schools set. Upon our careful review, we just found 11 out of 89 errors, which we plan to refine. We believe that multiple random samplings, as suggested in this paper (https://arxiv.org/pdf/2305.14210.pdf), would provide more reliable insights due to the large standard deviation in random sampling. Otherwise there may be a biased sampling.

2) The paper doesn't clarify under what conditions questions are ambiguous, i.e., with or without knowledge evidence. Normally, kg evidence originates from question annotators and is collected by experts to reflect user requirements. For example, in Table 4, example 2, the evidence clearly states "A11 refers to average salary", which is already included in the database description file. One feature of our work is to create a more realistic environment for database agent development, since humans also need column descriptions and data value samples to generate precise SQLs. Therefore, with certain knowledge and files, most questions are answerable. We deem that most humans cannot find out the correct columns about average salary among anonymous columns names (A1, A2, ..., A11...). They are usually provided with database files.

3) We mentioned in our paper that synonyms are one form of external knowledge that models should map them to identify corresponding schema items. This is a significant challenge in real text-to-SQL models. The ability to access database values and output valid values is also crucial. Actually, Our goal is to shift the focus of text-to-SQL models from only semantic parsing to understanding values. One easy way is to inspect value samples of databases as (https://arxiv.org/pdf/2204.00498.pdf).

4) The conclusions about GPT 3.5 outperforming other advanced methods after "fixing" via their rules are not very illustrative to us, because GPT models are constantly evolving. It's unclear whether the new GPT 3.5 has memorized our data or if additional reasoning is hard for weaker models 3.5 instead of GPT-4 to understand. We found that GPT-3.5 and Codellama-34B-instruct perform well with clear and less complex prompts in our recent work. However, their performance drops significantly with even additional explanations such as COT or synonyms. Therefore, more models should be involved in experiments. Also, MAC-SQL has the relative improvements on Spider, which contains very few or even no noise in questions. Why its performance is dropped in BIRD after they "fix" these noise.

5) We are not clear about that the distribution of errors are uneven or even, because most models in the leaderboard maintain the same order across dev and test. And Mac-SQL and DIN-SQL has improvements in both datasets.

6) Finally, 100 % clear questions without reference to the database environment including docs do not exist in the world, and can be considered a new method to deal with the realistic questions. Real DB AGIs are expected to search for the related information by observation among dirty values, complex shemas, database files and even external tools like web search by themselves to make the user question intents more clear in the real scenarios. That's why RAG and Tool-augmented research is important in AGI.

Even these, we sincerely appreciate their highlighting of the join conditions issue as Example 3 in which we did pay less attention on this (FK-->PK) during checking. We will pay more attention to in this review round.

In conclusion, we have learned a lot from these benchmark judgement papers and understand what we need to focus on next. We welcome more such papers or feedbacks to promote this field together. Also, we appreciate your support and feedback and will continue to refine our work until the real AGIs in DB are born. Thanks.

nnarodytska commented 8 months ago

Updated (one more inconsistency was found): "question_id": 99 has a possible incorrect ground truth SQL query :

Q: Among the accounts who have loan validity more than 12 months, list out the accounts that have the highest approved amount and have account opening date in 1993."

A (SQL): SELECT T1.account_id FROM loan AS T1 INNER JOIN disp AS T2 ON T1.account_id = T2.account_id WHERE STRFTIME('%Y', T1.date) = '1993' AND T1.duration = 12 ORDER BY T1.amount DESC LIMIT 1

I think the join should be on the "account" table rather than  "disp" and T1.duration > 12 Alos, the question asks to filter on "have account opening date". However, SQL query filters on load date: "STRFTIME('%Y', T1.date) = '1993'".

Suggested correction: SELECT T1.account_id FROM loan AS T1 INNER JOIN account AS T2 ON T1.account_id = T2.account_id WHERE STRFTIME('%Y', T2.date) = '1993' AND T1.duration > 12 ORDER BY T1.amount DESC LIMIT 1

Thanks!

nina


From dev.json

  "question_id": 99,         "db_id": "financial",         "question": "Among the accounts who have loan validity more than 12 months, list out the accounts that have the highest approved amount and have account opening date in 1993.",         "evidence": "Loan validity more than 12 months refers to duration > 12",         "SQL": "SELECT T1.account_id FROM loan AS T1 INNER JOIN disp AS T2 ON T1.account_id = T2.account_id WHERE STRFTIME('%Y', T1.date) = '1993' AND T1.duration = 12 ORDER BY T1.amount DESC LIMIT 1",         "difficulty": "moderate"

nnarodytska commented 8 months ago

"question_id": 95 has an incorrect ground truth SQL query:

Q: "List out the account numbers of clients who are youngest and have highest average salary?",

A (SQL): "SELECT T1.account_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id WHERE T2.district_id = ( SELECT district_id FROM client ORDER BY birth_date DESC LIMIT 1 ) ORDER BY T2.A11 DESC LIMIT 1"

Multiple accounts are linked with the district that has the highest A11 However, SQL query limits it to 1 (ORDER BY T2.A11 DESC LIMIT 1) which is incorrect.

Suggested correction to output all accounts with the highest average salary:

SELECT T1.account_id, MAX(T2.A11) FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id WHERE T2.district_id = ( SELECT district_id FROM client ORDER BY birth_date DESC LIMIT 1 ) GROUP BY T2.A11, T1.account_id


From dev.json

{ "question_id": 95, "db_id": "financial", "question": "List out the account numbers of clients who are youngest and have highest average salary?", "evidence": "If the person A's birthdate < B's birthdate, it means that person B is younger than person A; A11 refers to average salary", "SQL": "SELECT T1.account_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id WHERE T2.district_id = ( SELECT district_id FROM client ORDER BY birth_date DESC LIMIT 1 ) ORDER BY T2.A11 DESC LIMIT 1", "difficulty": "moderate" },

Thanks!

nina

nnarodytska commented 8 months ago

Incorrect description of database "financial". Issue: A4-A7 are specified as text but their types are integers (as A8's type). The type makes a difference when we order by any of these columns as text and integer columns are sorted differently. For example, in "question_id": 115, "db_id": "financial", there is an ordering over A4: "SELECT CAST(SUM(T1.gender = 'M') AS REAL) * 100 / COUNT(T1.client_id) FROM client AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id WHERE T2.A3 = 'south Bohemia' GROUP BY T2.A4 ORDER BY T2.A4 DESC LIMIT 1",

File: llm/data/dev_databases/financial/database_description/district.csv A4,number of inhabitants ,,text, A5,no. of municipalities with inhabitants < 499,municipality < district < region,text, A6,no. of municipalities with inhabitants 500-1999,municipality < district < region,text, A7,no. of municipalities with inhabitants 2000-9999,municipality < district < region,text, A8,no. of municipalities with inhabitants > 10000,municipality < district < region,integer,

Suggested corrections:

A4,number of inhabitants ,,integer, A5,no. of municipalities with inhabitants < 499,municipality < district < region,integer, A6,no. of municipalities with inhabitants 500-1999,municipality < district < region,integer, A7,no. of municipalities with inhabitants 2000-9999,municipality < district < region,integer, A8,no. of municipalities with inhabitants > 10000,municipality < district < region,integer,

nnarodytska commented 8 months ago

_"questionid": 118 has an incorrect ground truth SQL query:

Q: "For loan amount less than USD100,000, what is the percentage of accounts that is still running with no issue.",

E (evidence): "Status = 'C' stands for running contract, ok so far; Percentage of accounts by condition = [(total(amount) & condition) / (total amount)] * 100%.",

A (SQL): 'SELECT CAST(SUM(status = 'C') AS REAL) * 100 / COUNT(amount) FROM loan WHERE amount < 100000',

There are a few issues with SQL:

According to the evidence, we need to divide the total amount of running contract accounts by the total amount (" [(total(amount) & condition) / (total amount)] * 100%."). However, SQL incorrectly

Suggested correction (discussion): It's not clear what the best way to proceed is. The simplest fix would be to correct the SQL. However, I think the actual issue is that the evidence doesn't really correspond to the question. The question asks about the percentage of accounts, but the evidence suggests that we will need to find the percentage of account amounts.


From dev.json

{
    "question_id": 118,
    "db_id": "financial",
    "question": "For loan amount less than USD100,000, what is the percentage of accounts that is still running with no issue.",
    "evidence": "Status = 'C' stands for running contract, ok so far; Percentage of accounts by condition = [(total(amount) & condition) / (total amount)] * 100%.",
    "SQL": "SELECT CAST(SUM(status = 'C') AS REAL) * 100 / COUNT(amount) FROM loan WHERE amount < 100000",
    "difficulty": "moderate"
},
nnarodytska commented 8 months ago

_"questionid": 141 has an incorrect ground truth SQL query:

Q: "Which districts have transactions greater than USS$10,000 in 1997?", A (SQL): "SELECT T1.district_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T1.account_id = T3.account_id WHERE STRFTIME('%Y', T3.date) = '1997' GROUP BY T1.district_id HAVING SUM(T3.amount) > 10000",

An issue with SQL:

The question asks about all transactions that satisfy conditions. However, SQL computes the total amount of transactions per district (GROUP BY T1.district_id HAVING SUM(T3.amount) > 10000) in the WHERE clause. Namely, "HAVING SUM(T3.amount)" computes the sum of transaction amounts per district.

Suggested correction :

"SELECT T1.district_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T1.account_id = T3.account_id WHERE STRFTIME('%Y', T3.date) = '1997' and T3.amount > 10000 GROUP BY T1.district_id"


From dev.json

{
    "question_id": 141,
    "db_id": "financial",
    "question": "Which districts have transactions greater than USS$10,000 in 1997?",
    "evidence": "",
    "SQL": "SELECT T1.district_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T1.account_id = T3.account_id WHERE STRFTIME('%Y', T3.date) = '1997' GROUP BY T1.district_id HAVING SUM(T3.amount) > 10000",
    "difficulty": "simple"
},
nnarodytska commented 8 months ago

_"questionid": 143 has an incorrect ground truth SQL query:

Q: "What are the accounts that have both gold and junior credit cards?", A (SQL): "SELECT T2.account_id FROM card AS T1 INNER JOIN disp AS T2 ON T1.disp_id = T2.disp_id WHERE T1.type IN ('gold', 'junior')",

Issues with SQL:

The question asks about accounts that have both gold and junior credit cards. However, SQL returns accounts that have at least one of them ( IN ('gold', 'junior')) .

Suggested correction (one of many):

"SELECT T2.account_id FROM disp AS T2 
INNER JOIN card AS T1 ON T1.disp_id = T2.disp_id 
INNER JOIN card AS T3 ON T3.disp_id = T2.disp_id 
WHERE T1.card_id = T3.card_id and T1.type = 'gold' and T3.type = 'junior'"

From dev.json { "question_id": 143, "db_id": "financial", "question": "What are the accounts that have both gold and junior credit cards?", "evidence": "", "SQL": "SELECT T2.account_id FROM card AS T1 INNER JOIN disp AS T2 ON T1.disp_id = T2.disp_id WHERE T1.type IN ('gold', 'junior')", "difficulty": "simple" },

nnarodytska commented 8 months ago

_"questionid": 145 has an incorrect ground truth SQL query:

Q: "Who are the account holder identification numbers whose spent per month on the credit card is less than the average, in 1998?" A (SQL): "SELECT T1.Account_id FROM Trans AS T1 INNER JOIN Account AS T2 ON T1.Account_id = T2.Account_id WHERE DATE_FORMAT(T1.date, '%Y') = '1998' AND T1.operation = 'VYBER KARTOU' AND T1.amount< (SELECT AVG(amount) FROM Trans WHERE DATE_FORMAT(date, '%Y') = '1998')",

Issues with SQL:

The question asks about accounts whose spent per month on the credit card is less than the average in 1998 However, SQL returns accounts with credit cards that have transactions with an amount less than the average in 1998

Suggested correction discussion: We need to compute "spent per month on the credit card" to answer this question. Can you please confirm that it was intended to return "spent per month" on the credit card? Thanks!


From dev.json

{
    "question_id": 145,
    "db_id": "financial",
    "question": "Who are the account holder identification numbers whose spent per month on the credit card is less than the average, in 1998?",
    "evidence": "Operation = 'VYBER KARTOU' refers to credit card withdrawal",
    "SQL": "SELECT T1.account_id FROM trans AS T1 INNER JOIN account AS T2 ON T1.account_id = T2.account_id WHERE STRFTIME('%Y', T1.date) = '1998' AND T1.operation = 'VYBER KARTOU' AND T1.amount < (SELECT AVG(amount) FROM trans WHERE STRFTIME('%Y', date) = '1998')",
    "difficulty": "moderate"
},
nnarodytska commented 8 months ago

_"questionid": 162 has an incorrect ground truth SQL query:

Q: "What is the region of the client with the id 3541 from?" A (SQL): "SELECT T2.district_id, T1.A3 FROM district AS T1 INNER JOIN client AS T2 ON T1.district_id = T2.district_id WHERE T2.client_id = 3541",

Issues with SQL:

The question asks about regions only. SQL provides information about region and district id (which is not part of the question)

Suggested correction : "SELECT T1.A3 FROM district AS T1 INNER JOIN client AS T2 ON T1.district_id = T2.district_id WHERE T2.client_id = 3541",


From dev.json

{
    "question_id": 162,
    "db_id": "financial",
    "question": "What is the region of the client with the id 3541 from?",
    "evidence": "A3 refers to region",
    "SQL": "SELECT T2.district_id, T1.A3 FROM district AS T1 INNER JOIN client AS T2 ON T1.district_id = T2.district_id WHERE T2.client_id = 3541",
    "difficulty": "simple"
},
nnarodytska commented 8 months ago

_"questionid": 163 has a possible incorrect ground truth SQL query:

Q: "Which district has the most accounts with loan contracts finished with no problems?" A (SQL): "SELECT T1.district_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN loan AS T3 ON T2.account_id = T3.account_id WHERE T3.status = 'A' GROUP BY T1.district_id ORDER BY COUNT(T2.account_id) DESC LIMIT 1",

Issues with SQL:

The question asks about a district that satisfies a condition ("Which district"). District table has an A2 column which is a district_name. It is logical to output the district name when we asked "Which district..." However, SQL outputs _districtid.

Suggested correction : "SELECT T1.A2 FROM District AS T1 INNER JOIN Account AS T2 ON T1.District_id = T2.District_id INNER JOIN Loan AS T3 ON T2.Account_id = T3.Account_id WHERE T3.status = 'A' GROUP BY T1.District_id ORDER BY COUNT(T2.Account_id) DESC LIMIT 1"


From dev.json { "question_id": 163, "db_id": "financial", "question": "Which district has the most accounts with loan contracts finished with no problems?", "evidence": "status = 'A' refers to loan contracts finished with no problems", "SQL": "SELECT T1.district_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN loan AS T3 ON T2.account_id = T3.account_id WHERE T3.status = 'A' GROUP BY T1.district_id ORDER BY COUNT(T2.account_id) DESC LIMIT 1", "difficulty": "moderate" },

nnarodytska commented 8 months ago

_"questionid": 180 has an incorrect ground truth SQL query:

Q: "Please provide a list of clients who were born between 1983 and 1987 and whose account branch is in East Bohemia, along with their IDs." A (SQL): _"SELECT T1.client_id, T3.account_id FROM client AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN account AS T3 ON T2.district_id = T3.district_id WHERE T2.A3 = 'east Bohemia' AND STRFTIME('%Y', T1.birthdate) BETWEEN '1983' AND '1987'",

Issues with SQL: The question asks about clients who have accounts with conditions. However, SQL joins a) client with district on district_id to obtain client-to-district mapping b) with account on district_id. However, it includes accounts that might belong to a different client!

If one runs the ground truth SQL against the database you will get the following pairs (clint_id, account_id) for client 379, for example:

"(379, 56), (379, 237), (379, 311), (379, 351), (379, 446), (379, 562), (379, 590), (379, 757), (379, 1002), (379, 1158), (379, 1199), (379, 1296), (379, 1406), (379, 1437), (379, 1441), (379, 1501), (379, 1528), (379, 1641), (379, 1645), (379, 1810), (379, 1859), (379, 1902), (379, 2162), (379, 2266), (379, 2334), (379, 2584), (379, 2674), (379, 2926), (379, 3175), (379, 3562), (379, 3616), (379, 3661), (379, 3691), (379, 3699), (379, 3835), (379, 3865), (379, 3890), (379, 3945), (379, 4127), (379, 4240), (379, 4503), (379, 5952), (379, 6461), (379, 7713), (379, 8225), (379, 10857), (379, 10940), (379, 11317)"

However, client 379 has only 1 account "311"., so only one pair is correct and others are incorrect.

Suggested correction :

We need to ensure that only valid client_id, account_id pairs are joined. To do so, we need to make sure that we check that a client with _clientid does have an account with _accountid in each output. Here is a suggested correction.

"""SELECT T1.client_id, T3.account_id FROM client AS T1 \ INNER JOIN district AS T2 ON T1.district_id = T2.district_id \ INNER JOIN disp AS T4 ON T1.client_id = T4.client_id \ INNER JOIN account AS T3 ON T2.district_id = T3.district_id and T4.account_id = T3.account_id \ WHERE T2.A3 = 'east Bohemia' AND STRFTIME('%Y', T1.birth_date) BETWEEN '1983' AND '1987'"""


From dev.json

{
    "question_id": 180,
    "db_id": "financial",
    "question": "Please provide a list of clients who were born between 1983 and 1987 and whose account branch is in East Bohemia, along with their IDs.",
    "evidence": "",
    "SQL": "SELECT T1.client_id, T3.account_id FROM client AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN account AS T3 ON T2.district_id = T3.district_id WHERE T2.A3 = 'east Bohemia' AND STRFTIME('%Y', T1.birth_date) BETWEEN '1983' AND '1987'",
    "difficulty": "moderate"
},
nnarodytska commented 8 months ago

_"questionid": 181 has an incorrect ground truth SQL query:

Q: "Please provide the IDs of the 3 female clients with the largest loans." A (SQL): "SQL": "SELECT T1.client_id FROM client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN loan AS T3 ON T2.account_id = T3.account_id WHERE T1.gender = 'F' ORDER BY T3.amount DESC LIMIT 3"

Issues with SQL:

Conceptually, the same issue as in 180. First, SQL query finds clients that have accounts using "client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id" However, joining these two tables (client and account) on district_id only is incorrect. _Namely, a client who has an account with district_id = X will be associated with all accounts with districtid = X (including accounts of other clients)

Suggested correction :

The same as in 180. Using the table disp to find valid client account associations.


From dev.json

    {
        "question_id": 181,
        "db_id": "financial",
        "question": "Please provide the IDs of the 3 female clients with the largest loans.",
        "evidence": "Female refers to gender = 'F'",
        "SQL": "SELECT T1.client_id FROM client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN loan AS T3 ON T2.account_id = T3.account_id WHERE T1.gender = 'F' ORDER BY T3.amount DESC LIMIT 3",
        "difficulty": "simple"
    },
nnarodytska commented 8 months ago

"question_id": 186 has an incorrect ground truth SQL query:

Q: "What percentage of male clients request for weekly statements to be issued?", A (SQL): "SELECT CAST(SUM(T1.gender = 'M') AS REAL) * 100 / COUNT(T1.client_id) FROM client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id WHERE T2.frequency = 'POPLATEK TYDNE'",

Issues with SQL: The same issue as 180, 181: joining client and account in districtid only "client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.districtid"


From dev.json

    {
        "question_id": 186,
        "db_id": "financial",
        "question": "What percentage of male clients request for weekly statements to be issued?",
        "evidence": "Percentage of male clients = [count(male clients who requested weekly statements / count(clients who requested weekly statements)] * 100%; Male means gender = 'M'; 'POPLATEK TYDNE' stands for weekly issuance",
        "SQL": "SELECT CAST(SUM(T1.gender = 'M') AS REAL) * 100 / COUNT(T1.client_id) FROM client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id WHERE T2.frequency = 'POPLATEK TYDNE'",
        "difficulty": "moderate"
    },
nnarodytska commented 8 months ago

_"questionid": 189 has an incorrect ground truth SQL query: Q: "Name the account numbers of female clients who are oldest and have lowest average salary?", A (SQL): "SELECT T3.account_id FROM client AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN account AS T3 ON T2.district_id = T3.district_id WHERE T1.gender = 'F' ORDER BY T1.birth_date ASC, T2.A11 ASC LIMIT 1",

Issues with SQL: The same issue as above. joining client and account in district_id only


From dev.json

{
    "question_id": 189,
    "db_id": "financial",
    "question": "Name the account numbers of female clients who are oldest and have lowest average salary?",
    "evidence": "Female refers to 'F' in the gender; A11 contains information about average salary",
    "SQL": "SELECT T3.account_id FROM client AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN account AS T3 ON T2.district_id = T3.district_id WHERE T1.gender = 'F' ORDER BY T1.birth_date ASC, T2.A11 ASC LIMIT 1",
    "difficulty": "moderate"
},
nnarodytska commented 8 months ago

_"questionid": 192 has an incorrect ground truth SQL query:

Q: "What is the average amount of loan which are still on running contract with statement issuance after each transaction?" A (SQL): "SELECT AVG(T2.payments) FROM account AS T1 INNER JOIN loan AS T2 ON T1.account_id = T2.account_id WHERE T2.status IN ('C', 'D') AND T1.frequency = 'POPLATEK PO OBRATU'"

Issues with SQL: The question asks about the average amount of loan. SQL outputs average payments

Suggested correction :

"SELECT AVG(T2.amount) FROM account AS T1 INNER JOIN loan AS T2 ON T1.account_id = T2.account_id WHERE T2.status IN ('C', 'D') AND T1.frequency = 'POPLATEK PO OBRATU'"


From dev.json

    {
        "question_id": 192,
        "db_id": "financial",
        "question": "What is the average amount of loan which are still on running contract with statement issuance after each transaction?",
        "evidence": "status = 'C' stands for running contract, OK so far; status = 'D' stands for running contract, client in debt. 'POPLATEK PO OBRATU' stands for issuance after transaction",
        "SQL": "SELECT AVG(T2.payments) FROM account AS T1 INNER JOIN loan AS T2 ON T1.account_id = T2.account_id WHERE T2.status IN ('C', 'D') AND T1.frequency = 'POPLATEK PO OBRATU'",
        "difficulty": "moderate"
    },
nnarodytska commented 8 months ago

_"questionid": 129 has an incorrect ground truth SQL query:

"question": "Which are the top ten withdrawals (non-credit card) by district names for the month of January 1996?", "SQL": "SELECT T1.district_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T2.account_id = T3.account_id WHERE T3.type = 'VYDAJ' AND T2.date LIKE '1996-01%' ORDER BY A2 ASC LIMIT 10",

Issues with SQL:

Issue 1 SQL (T2.date LIKE '1996-01%') filters on accounts that were opened in Jan, 1996 Question asks about transations for the month of January 1996 (should be filtered on T3.date)

Issue 2 SQL (SELECT T1.district_id) output district_ids with repetitions. Question asks about output ordered district names without repetitions.

Suggested correction : "SQL": "SELECT DISTINCT T1.A2 FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T2.account_id = T3.account_id WHERE T3.type = 'VYDAJ' AND T3.date LIKE '1996-01%' ORDER BY A2 ASC LIMIT 10",


From dev.json

{
    "question_id": 129,
    "db_id": "financial",
    "question": "Which are the top ten withdrawals (non-credit card) by district names for the month of January 1996?",
    "evidence": "Non-credit card withdraws refers to type = 'VYDAJ'; January 1996 can be found by date LIKE '1996-01%' in the database; A2 means district names",
    "SQL": "SELECT T1.district_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T2.account_id = T3.account_id WHERE T3.type = 'VYDAJ' AND T2.date LIKE '1996-01%' ORDER BY A2 ASC LIMIT 10",
    "difficulty": "moderate"
},
nnarodytska commented 8 months ago

Hi @huybery and @accpatrick,

Thank you for developing and maintaining BIRD!

I went over many of the question/SQL pairs for the financialdataset. Can you please have a look at a few issues I have reported with ground truth SQLs? Most of them would be easy to fix if you agree with them and fixes. However, one issue occurs in many question/ground-truth-SQL pairs and would require more attention. This is a logical issue related to the structure of the database relationships.

There is a M:M relationship between clientsand accountsin the database. It is modeled using the 'disp' table, which has columns client_idand account_idthat connect clients and accounts that are related.

In many questions, we need to find information about clients and their accounts. To answer these questions correctly, we have to join clientsand accounts with 'disp' table to find valid client-account pairs. Many ground truth SQLs answer these questions by joining clientsand accounts with the'district' table on the column 'district_id'. However, this logic is incorrect as it returns records containing unrelated clients and accounts. Please see the example above (for "question_id": 180 ). I have reported a few more questions above; the issue is also occurring in questions 94, 95, 113, 132, 176, and 177 as far as I can see (There might be more questions with this issue).

Please let me know what you think.

Thanks!

nina

freiz commented 7 months ago

Hi BIRD team,

I just wanted to reach out and say a huge thank you for all the amazing work you've put into creating and maintaining the standard dataset.

So, I recently built a solution and decided to sample 30 questions from the dev set to evaluate it. As I was going through the failing questions, I noticed that there were 6 incorrect SQL queries that I'm pretty confident about. That's a bit higher than I expected, to be honest (around 20% or more, However, the significance is limited by the small sample size.).

Id | DB | Question | Evidence | Golden | Comment -- | -- | -- | -- | -- | -- 70 | california_schools | How many active and closed District Community Day Schools are there in the county of Alpine? |   | SELECT COUNT(School) FROM schools WHERE (StatusType = 'Closed' OR StatusType = 'Active') AND County = 'Alpine' | Missed filtering option: 'District Community Day School' 484 | card_games | Please list the Italian names of the cards in the set Coldsnap with the highest converted mana cost. | card set Coldsnap refers to name = 'Coldsnap'; Italian refers to language = 'Italian' | SELECT T2.name FROM foreign_data AS T1 INNER JOIN cards AS T2 ON T2.uuid = T1.uuid INNER JOIN sets AS T3 ON T3.code = T2.setCode WHERE T3.name = 'Coldsnap' AND T1.language = 'Italian' ORDER BY T2.convertedManaCost DESC LIMIT 1 | 1. the selected name is not italian name (not from the foreign_data table)2. should not only limit 1 592 | codebase_community | How many users are awarded with more than 5 badges? | more than 5 badges refers to Count (Name) > 5; user refers to UserId | SELECT UserId FROM ( SELECT UserId, COUNT(Name) AS num FROM badges GROUP BY UserId ) T WHERE T.num > 5 | Not answering the question, should return COUNT 927 | formula_1 | Which driver created the fastest lap speed in a Formula_1 race? Please give both his forename and surname. |   | SELECT T1.forename, T1.surname FROM drivers AS T1 INNER JOIN results AS T2 ON T2.driverId = T1.driverId WHERE T2.fastestLapTime IS NOT NULL ORDER BY T2.fastestLapSpeed DESC LIMIT 1 | fastestLapSpeed is string instead of number, sorting without casting to number is incorrectMinor: filtering fastestLapTime IS NOT NULL does not make a lot sense, however there is no missing data anyway 1029 | european_football_2 | What are the speed in which attacks are put together of the top 4 teams with the highest build Up Play Speed? | speed in which attacks are put together refers to buildUpPlaySpeed;highest build up play speed refers to MAX(buildUpPlaySpeed) | SELECT t1.buildUpPlaySpeed FROM Team_Attributes AS t1 INNER JOIN Team AS t2 ON t1.team_api_id = t2.team_api_id ORDER BY t1.buildUpPlayDribbling ASC LIMIT 4 | Should sort by buildUpPlaySpeed instead of buildUpPlayDribbling based on the Evidence 1259 | thrombosis_prediction | Please list the diseases of the patients born after 1995-1-1 and have a normal Rhuematoid Factor. | diseases refers to Diagnosis; born after 1995/1/1 refers to YEAR(Birthday) > = 1995; normal Rhuematoid Factor refers to RA IN('-', '+-'); | SELECT T1.Diagnosis FROM Patient AS T1 INNER JOIN Laboratory AS T2 ON T1.ID = T2.ID WHERE (T2.RA = '-' OR T2.RA = '+-') AND T1.Birthday > 1995-01-01 | date 1995-01-01 is not quoted, so it compares a date with number 1993 Just to be sure, I double-checked that I'm using the latest data and compared the SQL in `dev.json` and `dev.gold.sql` to make sure everything lines up. But hey, if I've made any silly mistakes or wrong assumptions while working with the data, please give me a heads up! @huybery , @accpatrick
accpatrick commented 6 months ago

@nnarodytska @freiz Thank you sincerely for your insightful feedbacks! We are processing this and your feedbacks are very valuable, hope the more high-quality version will be presented soon. Thanks and stay tuned!

JimXiongGM commented 5 months ago

In the training data example:

{
    "db_id": "address",
    "question": "How many states are in the central time zone? Write their full names.",
    "evidence": "central time zone refers to time_zone = 'Central'",
    "SQL": "SELECT SUM(CASE WHEN T1.time_zone = 'Central' THEN 1 ELSE 0 END) AS count FROM zip_data AS T1 INNER JOIN state AS T2 ON T2.abbreviation = T1.state WHERE T1.time_zone = 'Central'",
}

Using SELECT SUM(CASE WHEN ... is clearly incorrect. The annotated SQL is counting the total number of records that match the condition where time_zone is 'Central', but not answering the question "How many states". For example, a single state may have many different zip_data records.

I think the correct SQL should be:

SELECT COUNT(DISTINCT T2.name)
FROM zip_data AS T1
INNER JOIN state AS T2 ON T1.state = T2.abbreviation
WHERE T1.time_zone = 'Central'

There is a lot of use of CASE WHEN in the training data for filtering. Why not use the WHERE clause instead?

NL2SQL-Empirical commented 4 months ago

Hi, BIRD Team. Thank you for developing and maintaining BIRD!

While thoroughly reviewing the dev set of BIRD Benchmark, we have identified a Foreign Key Integrity problem within the database, which involves records that do not satisfy the Foreign Key constraints. In the california_schools database, satscores.cds references schools.CDSCode. However, we have found that there are entries in satscores.cds that do not exist in schools.CDSCode, which clearly violates the Foreign Key constraint. This issue can lead to mis-judgement, as exemplified by the following scenario in the dev set, question 22:

SELECT sname FROM satscores WHERE cname = 'Contra Costa' AND sname IS NOT NULL ORDER BY NumTstTakr DESC LIMIT 1  -- Gold
SELECT T2.sname FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds WHERE T1.County = 'Contra Costa' ORDER BY T2.NumTstTakr DESC LIMIT 1  -- Pred

Given that satscores.cds references schools.CDSCode, a natural JOIN operation should not result in any loss of records from satscores. However, the row selected by the Gold query does not exist in the following modified query from the Pred:

SELECT cds FROM satscores WHERE cname = 'Contra Costa' AND sname IS NOT NULL ORDER BY NumTstTakr DESC LIMIT 1  -- Modified from Gold
SELECT T1.CDSCode FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.`CDSCode` = T2.`cds`  -- Modified from Pred
-- Gold's cds not in Pred's T1.CDSCode!

Consequently, the Pred query cannot yield the same result as the Gold query. Under normal circumstances, where Foreign Key Integrity is satisfied, the semantics of these two SQL queries should be the same.

Foreign Key Integrity problem can be easily identified by running PRAGMA foreign_key_check; in SQLite3.

Although SQLite3 disables foreign key constraints by default, MySQL and other database systems automatically enable them. We also want to know if there are errors related to this when migrating to these database systems (like your recent work 'mini-dev'), and how this problem can be solved.

Thank you.

hansonw commented 3 months ago

Hi BIRD team - big thanks for working on this benchmark!

Our team recently did a detailed audit of the minidev dataset (500 questions) and found 91 errors (~18.27%) in the gold SQL as well as a further 70 questions (~14%) where we felt that some additional corrections were necessary to avoid ambiguities or inconsistencies between the question and SQL. (Also: question IDs 137 and 138 both appear twice, so there are only 498 unique questions in the minidev dataset).

We are happy to share our corrected version of the dataset with the community in this spreadsheet:

https://docs.google.com/spreadsheets/d/1IGm9Otruey60ujUnl8AOkepY3qgWHdFJHnX7hQGUeCw/edit

As well as a bird_minidev_corrections.json file with updated question, evidence, and SQL fields:

https://drive.google.com/file/d/1iWlYVknwK5wGli5lnwg4stvNzMogjhwj/view

We have also annotated each correction with both a detailed reason for the correction as well as a more fine-grained classification of the correction type and SQL error type. The overall summary stats are available in the spreadsheet, but also attaching a screenshot here:

CleanShot 2024-08-21 at 16 00 09@2x

In total, we made corrections to 161/498 (32.33%) of the questions. In general, we tried to make the smallest modification to the question + SQL to make the result unambiguous and correct, although some questions had to be redesigned slightly to have meaningful results with the data provided (for example, some questions refer to dates that do not exist in the DB.)

The most common classes of errors were the following:

  1. Failure to COUNT DISTINCT (or similar) when calculating counts or ratios after a one-to-many join. For example: asking for the percentage of players that match a certain attribute if players can have multiple attributes. (~5.62% of overall questions)
  2. Discrepancies between the output columns of the query vs. what was requested by the question. Most of these queries are otherwise semantically "correct", but the official EX accuracy score does an exact column match which requires very precise answers and questions (4.62% + 3.82% ~= 8% of overall questions)

As a reference, our 'Arcwise AI' agent submission (still awaiting final results on the test set) is able to achieve about 87% accuracy on the corrected minidev set, vs. only ~65% on the original (uncorrected) minidev set.

We hope this is helpful for others trying to evaluate results on the dev data - we found that the high rate of errors/ambiguities made it very difficult to accurately assess the performance of AI models. Please feel free to contact us directly or comment in the spreadsheet for any clarifications.

erikskalnes commented 2 months ago

Hi BIRD team, Thanks for maintaining and updating BIRD. We are finding it an invaluable resource at Seek for research. We think we may have found a slight error in a gold, and wanted to let you know.

- question_id:
413
- question:
How many cards with print rarity have ruling text printed on 10/01/2009?
- gold_sql:
SELECT COUNT(DISTINCT T1.id)
FROM cards AS T1 INNER
JOIN rulings AS T2
ON T1.uuid = T2.uuid
WHERE T1.rarity = 'rare' AND T2.date = '2009-01-10'
- correct_sql:
SELECT COUNT(DISTINCT T1.id)
FROM cards AS T1 INNER
JOIN rulings AS T2
ON T1.uuid = T2.uuid
WHERE T1.rarity = 'rare' AND T2.date = '2009-10-01'
- reason:
date column is in 'YYYY-MM-DD
gold_execution: [(0,)]
correct_execution: [(564,)]
bird-bench commented 1 month ago

Hi BIRD team - big thanks for working on this benchmark!

Our team recently did a detailed audit of the minidev dataset (500 questions) and found 91 errors (~18.27%) in the gold SQL as well as a further 70 questions (~14%) where we felt that some additional corrections were necessary to avoid ambiguities or inconsistencies between the question and SQL. (Also: question IDs 137 and 138 both appear twice, so there are only 498 unique questions in the minidev dataset).

We are happy to share our corrected version of the dataset with the community in this spreadsheet:

https://docs.google.com/spreadsheets/d/1IGm9Otruey60ujUnl8AOkepY3qgWHdFJHnX7hQGUeCw/edit

As well as a bird_minidev_corrections.json file with updated question, evidence, and SQL fields:

https://drive.google.com/file/d/1iWlYVknwK5wGli5lnwg4stvNzMogjhwj/view

We have also annotated each correction with both a detailed reason for the correction as well as a more fine-grained classification of the correction type and SQL error type. The overall summary stats are available in the spreadsheet, but also attaching a screenshot here:

CleanShot 2024-08-21 at 16 00 09@2x

In total, we made corrections to 161/498 (32.33%) of the questions. In general, we tried to make the smallest modification to the question + SQL to make the result unambiguous and correct, although some questions had to be redesigned slightly to have meaningful results with the data provided (for example, some questions refer to dates that do not exist in the DB.)

The most common classes of errors were the following:

  1. Failure to COUNT DISTINCT (or similar) when calculating counts or ratios after a one-to-many join. For example: asking for the percentage of players that match a certain attribute if players can have multiple attributes. (~5.62% of overall questions)
  2. Discrepancies between the output columns of the query vs. what was requested by the question. Most of these queries are otherwise semantically "correct", but the official EX accuracy score does an exact column match which requires very precise answers and questions (4.62% + 3.82% ~= 8% of overall questions)

As a reference, our 'Arcwise AI' agent submission (still awaiting final results on the test set) is able to achieve about 87% accuracy on the corrected minidev set, vs. only ~65% on the original (uncorrected) minidev set.

We hope this is helpful for others trying to evaluate results on the dev data - we found that the high rate of errors/ambiguities made it very difficult to accurately assess the performance of AI models. Please feel free to contact us directly or comment in the spreadsheet for any clarifications.

Thanks sincerely for your careful check and feedback. Your suggestions are well-received and analyzed by us. We will consider these and involve important components into our next generation of benchmark. Your contributions will be no doubt acknowledged. Thanks again!

NL2SQL-Empirical commented 1 week ago

Hi, BIRD Team. Thank you for developing and maintaining BIRD!

While thoroughly reviewing the dev set of BIRD Benchmark, we have identified several annotation issues. We record all question_id and why it is incorrect in this spreadsheet: https://docs.google.com/spreadsheets/d/1jjE3ph7TdmecAweTc5X0PdMisMgQkEDGJDJ1N4FE0yM/edit?usp=sharing

Hope this is helpful for maintaining and improving the quality of BIRD!

bird-bench commented 1 week ago

Hi, BIRD Team. Thank you for developing and maintaining BIRD!

While thoroughly reviewing the dev set of BIRD Benchmark, we have identified several annotation issues. We record all question_id and why it is incorrect in this spreadsheet: https://docs.google.com/spreadsheets/d/1jjE3ph7TdmecAweTc5X0PdMisMgQkEDGJDJ1N4FE0yM/edit?usp=sharing

Hope this is helpful for maintaining and improving the quality of BIRD!

THanks a million, it's our pleasure to receive such invaluable judgement! We will deeply analyze it and enhance our evaluation. If we choose to include these as part of our new benchmarks, we will acknowledge you. Thanks!