eakmanrq / sqlframe

Turning PySpark Into a Universal DataFrame API
https://sqlframe.readthedocs.io/en/stable/
MIT License
174 stars 3 forks source link

duckdb, parser fails if the column names are part of sql keyword space #112

Closed cristian-marisescu closed 1 week ago

cristian-marisescu commented 1 week ago

Seems like parser is picking up the "end" part and sees it as invalid SQL


from sqlframe.duckdb import DuckDBSession

session = DuckDBSession()

df = session.createDataFrame(
    [("2024-01-01", "2024-05-05"), ("2024-05-21", "2024-12-05")],
    ["start", "end"],
)

print(df.sql())
print(df.show())

result

SELECT
  CAST("a1"."start" AS TEXT) AS "start",
  CAST("a1"."end" AS TEXT) AS "end"
FROM (VALUES
  ('2024-01-01', '2024-05-05'),
  ('2024-05-21', '2024-12-05')) AS "a1"("start", "end")
Traceback (most recent call last):
  File "/workspaces/playground.py", line 18, in <module>
    print(df.show())
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/dataframe.py", line 1576, in show
    result = self.session._fetch_rows(sql)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/session.py", line 458, in _fetch_rows
    self._execute(sql, quote_identifiers=quote_identifiers)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/session.py", line 430, in _execute
    self._cur.execute(self._to_sql(sql, quote_identifiers=quote_identifiers))
duckdb.duckdb.ParserException: Parser Error: syntax error at or near "end

Since the exception is coming from duckdb, I thought that generated sql won't work directly in duckdb, but it does.

duckdb

import duckdb

con = duckdb.connect()

query = """
SELECT
  CAST("a1"."start" AS TEXT) AS "start",
  CAST("a1"."end" AS TEXT) AS "end"
FROM (VALUES
  ('2024-01-01', '2024-05-05'),
  ('2024-05-21', '2024-12-05')) AS "a1"("start", "end");
"""

result = con.execute(query)
print(result.df())

Same thing happens with other SQL specific keywords(that I tested), like "create", "in", "is", "join" and "like"