Implicit automatic conversion in pyspark will throw an error in duckdb

cristian-marisescu commented 1 week ago

Hi,

Not sure the direction here, but just dropping it and see where the discussion goes. PySpark has a lot of automatic conversions between types which, in turn, can lead developers in writing code without explicitly casting the type of a column. It's more of an edge case, really, than generalized behavior.

For the following code:

from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb import functions as F

spark = DuckDBSession()

df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], ["age", "name"])
df = df.withColumn(
    "combination_of_number_and_string",
    F.when(F.col("age") < 4, F.lit("under 4")).otherwise(F.col("age")),
)

print(df.sql())
print(df.show())

will throw

SELECT
  CAST("a1"."age" AS BIGINT) AS "age",
  CAST("a1"."name" AS TEXT) AS "name",
  CASE
    WHEN CAST("a1"."age" AS BIGINT) < 4
    THEN 'under 4'
    ELSE CAST("a1"."age" AS BIGINT)
  END AS "combination_of_number_and_string"
FROM (VALUES
  (2, 'Alice'),
  (5, 'Bob')) AS "a1"("age", "name")
Traceback (most recent call last):
  File "/workspaces/playground.py", line 19, in <module>
    print(df.show())
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/dataframe.py", line 1576, in show
    result = self.session._fetch_rows(sql)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/session.py", line 455, in _fetch_rows
    self._execute(sql, quote_identifiers=quote_identifiers)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/session.py", line 427, in _execute
    self._cur.execute(self._to_sql(sql, quote_identifiers=quote_identifiers))
duckdb.duckdb.ConversionException: Conversion Error: Could not convert string 'under 4' to INT64
LINE 1: ...LECT age, name, CASE WHEN age < 4 THEN 'under 4' ELSE age END AS combination_o...

This would work in pyspark, as the "under 4" won't fail being casted to int because 5 will be turned to string.

+---+-----+--------------------------------+                                    
|age| name|combination_of_number_and_string|
+---+-----+--------------------------------+
|  2|Alice|                         under 4|
|  5|  Bob|                               5|
+---+-----+--------------------------------+

Of course, being explicit about the types, by specifying

.otherwise(F.col("age").cast("str")),

will work with no issue, the sql will cast the final ELSE AS TEXT:

SELECT
  CAST("a1"."age" AS BIGINT) AS "age",
  CAST("a1"."name" AS TEXT) AS "name",
  CASE
    WHEN CAST("a1"."age" AS BIGINT) < 4
    THEN 'under 4'
    ELSE CAST(CAST("a1"."age" AS BIGINT) AS TEXT)
  END AS "combination_of_number_and_string"
FROM (VALUES
  (2, 'Alice'),
  (5, 'Bob')) AS "a1"("age", "name")
+-----+-------+----------------------------------+
| age |  name | combination_of_number_and_string |
+-----+-------+----------------------------------+
|  2  | Alice |             under 4              |
|  5  |  Bob  |                5                 |
+-----+-------+----------------------------------+

What do you think ? Part of me would want full 1-1 compatibility with pyspark behavior as it's so easy to just switch between engines. The other part is more like "just be explicit about the types"

eakmanrq commented 1 week ago

Yeah so PySpark is implicitly casting mixed types to the type that is coercible amongst the candidate types (if possible). Therefore you can't coerce a string to int but you can coerce an int to string so it falls back to that.

I think to handle this I would need to, for each possible output of a case statement, have logic that checks the type of the other potential outcomes and then cast to the common coercible type if a mix is found. I actually confirmed that PySpark will do this even if not all outcomes happen. So in your example if you change the age check to < 0 (which doesn't happen) then the column is still casted to string.

I'm leaning towards that complexity not being worth the benefit when the fix, as you pointed out, is very simple and you could argue making this implicit behavior explicit is a good thing. I do see though this eventually being fixed as SQLFrame generally becomes more type aware.

cristian-marisescu commented 1 week ago

Agree, given the reward - complexity + easy fix + best practice, we could close the topic for the moment and revisit it another time. Thank you very much for all of your work

eakmanrq / sqlframe

Implicit automatic conversion in pyspark will throw an error in duckdb #98