bgunlp / qpl

Code and dataset for the paper "Semantic Decomposition of Question and SQL for Text-to-SQL Parsing" EMNLP Findings 2023 - Ben Eyal et al BGU CS NLP Group
https://www.cs.bgu.ac.il/~elhadad/nlpproj/
2 stars 0 forks source link

Original SQL and CTE SQL mismatch in Spider-QPL dataset #6

Closed sarvam31 closed 7 months ago

sarvam31 commented 7 months ago

Hi,

I recently read your research paper and appreciate the work you and your team have put into the study. The paper is insightful and makes valuable contributions to the field.

I had some observations regarding the Spider-QPL dataset. While exploring train and dev set, I found that for some queries Original SQL and CTE SQL provides different results.

Examples:

  1. { "id":"ce556364c4075bbd0e189663fb5fc3c253cf59225fdff5fb4085fa167fbbdbbd", "db_id":"candidate_poll", "question":"Find the name of the candidates whose oppose percentage is the lowest for each sex.", "cte":"WITH Scan_1 AS ( SELECT People_ID, Oppose_rate FROM candidate_poll.candidate ), Scan_2 AS ( SELECT Sex, People_ID, Name FROM candidate_poll.people ), Join_3 AS ( SELECT Scan_2.Sex, Scan_2.Name, Scan_1.Oppose_rate FROM Scan_1 JOIN Scan_2 ON Scan_1.People_ID = Scan_2.People_ID ), Aggregate_4 AS ( SELECT Sex, Name, MIN(Oppose_rate) AS Min_Oppose_rate FROM Join_3 GROUP BY Sex , Name, Name, Sex ) SELECT * FROM Aggregate_4", }

    Here, as per question group by should have been on sex but in cte query, group by doesn't seem to be consistent with that understanding.

  2. { "id":"e0f60ed999370f8735907bbe27e1dc40ae9fe4e929fb31a12cc67e9f84d28816", "db_id":"world_1", "question":"Return the codes of countries for which Spanish is the predominantly spoken language.", "query":"SELECT CountryCode , max ( Percentage ) AS Max_Percentage FROM world_1.countrylanguage WITH (FORCESCAN) WHERE LANGUAGE = 'Spanish' GROUP BY CountryCode OPTION (HASH JOIN, ORDER GROUP)", "cte":"WITH Scan_1 AS ( SELECT CountryCode, Percentage, Language FROM world_1.countrylanguage WHERE Language = 'spanish' ), Aggregate_2 AS ( SELECT CountryCode, MAX(Percentage) AS Max_Percentage FROM Scan_1 GROUP BY CountryCode ) SELECT * FROM Aggregate_2" }

    In this example, query contains LANGUAGE = 'Spanish' but cte has Language = 'spanish'. Due to case mismatch cte output will be different from original sql's output.

As per my understanding, output of CTE SQL query should always match corresponding original SQL query provided in Spider dataset (from caption of Figure 3 in the paper). Kindly let me know if this is expected. If so, why ?

Additional Data:

Additionally, similar to the Spider-QPL dataset, have you prepared a QPL dataset for BIRD-SQL, and would you be willing to share it with me ?

melhadad commented 7 months ago

Hello Sarvam,

Thanks for your interest in our project.

  1. Regarding your note on the discrepancy between the original Spider SQL and the one in the dataset for id e0f60ed999370f8735907bbe27e1dc40ae9fe4e929fb31a12cc67e9f84d28816

The difference between the original Spider MySQL query and the standard strict SQL (as implemented in MSSQL) is according to the restrictions on the GROUP BY that exist in standard SQL. See for example https://learn.microsoft.com/en-us/sql/t-sql/queries/select-group-by-transact-sql?view=sql-server-ver16

The difference is that when you have a group by aggregate, the select list can only include columns that participate in the group by list or aggregate functions.

In this example, the column NAME cannot appear in the select list if it is not also included in the group by list:

SELECT t1.name , t1.sex , min ( oppose_rate ) AS Min_oppose_rate FROM candidate_poll.people AS t1 JOIN candidate_poll.candidate AS t2 ON t1.people_id = t2.people_id GROUP BY t1.name , t1.sex

In MySQL - this was implemented as:

SELECT t1.name , t1.sex , min ( oppose_rate ) AS Min_oppose_rate FROM candidate_poll.people AS t1 JOIN candidate_poll.candidate AS t2 ON t1.people_id = t2.people_id GROUP BY t1.sex

The MySQL non-strict version is semantically ambiguous - what should the query return in case there are different combinations of name and sex - should they all be grouped according to the sex value and then an arbitrary name among the multiple names would be returned, or in contrast a different row should be returned for each combination of name and sex - but then should the min range over each separate name,sex value or over all the various names with the same sex. Since there is no way to decide this syntax is forbidden in standard SQL.

In this specific schema, we know (based on the semantics of the schema) that for a given name, there should be a single sex value - which means that we can group by (sex, name) - and obtain the intended results.

In general - we did verify that the resultsets returned by the QPL queries converted to CTE on MSSQL match those returned by the original Spider MySQL queries using the same_rs (see validate_qpl.py).

  1. About your second point on the case of "Spanish" vs. "spanish". We decided when converting from MySQL to MSSQL to use a case insensitive sort for the database collation. Therefore, values will match regardless of case variations.

  2. We are in the process of converting BIRD-SQL to QPL - let me know if you want to assist in the effort.

Best regards Michael