eakmanrq / sqlframe

Turning PySpark Into a Universal DataFrame API
https://sqlframe.readthedocs.io/en/stable/
MIT License
174 stars 3 forks source link

F.to_date() fails when any format is specified #89

Closed cristian-marisescu closed 2 weeks ago

cristian-marisescu commented 2 weeks ago

Looking at the exact samples from documentation https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.to_date.html

First run, not including the format, will succeed.

But on adding a date_format argument, it will fail

from sqlframe.duckdb import DuckDBDataFrame as DataFrame
from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb import functions as F

spark = DuckDBSession()

df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
df = df.select(F.to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date'))
print(df.sql())
print(df.show())

it will fail with

SELECT
  CAST(CAST("a1"."t" AS TEXT) AS DATE) AS "date"
FROM (VALUES
  ('1997-02-28 10:30:00')) AS "a1"("t")
Traceback (most recent call last):
  File "/workspaces/playground.py", line 10, in <module>
    print(df.show())
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/dataframe.py", line 1555, in show
    result = self.session._fetch_rows(sql)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/session.py", line 455, in _fetch_rows
    self._execute(sql, quote_identifiers=quote_identifiers)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/session.py", line 427, in _execute
    self._cur.execute(self._to_sql(sql, quote_identifiers=quote_identifiers))
duckdb.duckdb.InvalidInputException: Invalid Input Error: Could not parse string "1997-02-28 10:30:00" according to format specifier "yyyy-MM-dd HH:mm:ss"
1997-02-28 10:30:00
 ^
Error: Literal does not match, expected yyyy-MM-dd HH:mm:ss
eakmanrq commented 2 weeks ago

Format needs to be in DuckDB format: https://duckdb.org/docs/sql/functions/dateformat.html#format-specifiers

If you search this page for to_date you will find the note on that: https://sqlframe.readthedocs.io/en/stable/duckdb/#functions

See this issue about allowing the user to use spark format on other engines: https://github.com/eakmanrq/sqlframe/issues/56

I wasn't sure if users would find it more intuitive to use the format from their engine or use Spark format. It seems like Spark format might be the preference. Please leave any feedback you may have on that issue.

cristian-marisescu commented 2 weeks ago

Thanks for the links, indeed Spark format is more practical for my current situation since the tests that I'm doing are pretty straightforward:

Current project is on pyspark and I'm just switching import statements from pyspark to sqlframe.duckdb to run existing spark code on duckdb, that's about it. More or less quick feedback as I do this in between breaks.

Next time, I will double check against docs and make sure I'm not getting lost around different commands, as it was the case with .size.

Thank you for the work you're doing.

eakmanrq commented 2 weeks ago

Yeah based on feedback I am getting it seems like using Spark format is more intuitive for users. I will make this change soon. I think I will change the default to be spark format and then later consider allowing users to define in their engine's format.