eakmanrq / sqlframe

Turning PySpark Into a Universal DataFrame API
https://sqlframe.readthedocs.io/en/stable/
MIT License
174 stars 3 forks source link

duckdb engine, .withColumn after a join doesn't replace the key column, but creates a new one #102

Closed cristian-marisescu closed 1 week ago

cristian-marisescu commented 1 week ago

Hi, Writing the following

import sqlframe.duckdb.functions as F
from sqlframe.duckdb import DuckDBSession

spark = DuckDBSession()

df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], ["age", "name"])
height = spark.createDataFrame([(170, "Alice"), (180, "Bob")], ["height", "name"])

df = df.join(height, how="left", on="name").withColumn("name", F.upper("name"))

print(df.sql())
print(df.show())

creates a third column, instead of replacing the existing one:

SELECT
  CAST("a1"."name" AS TEXT) AS "name",
  CAST("a1"."age" AS BIGINT) AS "age",
  CAST("a2"."height" AS BIGINT) AS "height",
  UPPER(CAST("a1"."name" AS TEXT)) AS "name"
FROM (VALUES
  (2, 'Alice'),
  (5, 'Bob')) AS "a1"("age", "name")
LEFT JOIN (VALUES
  (170, 'Alice'),
  (180, 'Bob')) AS "a2"("height", "name")
  ON CAST("a1"."name" AS TEXT) = CAST("a2"."name" AS TEXT)
+-------+-----+--------+--------+
|  name | age | height | name_3 |
+-------+-----+--------+--------+
| Alice |  2  |  170   | Alice  |
|  Bob  |  5  |  180   |  Bob   |
+-------+-----+--------+--------+

I would've expected same behavior as in pyspark:

| name|age|height|
+-----+---+------+
|ALICE|  2|   170|
|  BOB|  5|   180|
+-----+---+------+

Seems that generating sql is building columns as they come.

Wouldn't switching the algorithm to build CTE be preferable (maybe starting with just joins) ? In my mind, the model would be actually each pyspark dataframe action = one CTE

I see resolving a lot of edge cases + generating cleaner sql in different scenarios, but this is just a first instinct.

Something like

WITH joined as(
SELECT
  CAST("a1"."name" AS TEXT) AS "name",
  CAST("a1"."age" AS BIGINT) AS "age",
  CAST("a2"."height" AS BIGINT) AS "height",
FROM (VALUES
  (2, 'Alice'),
  (5, 'Bob')) AS "a1"("age", "name")
LEFT JOIN (VALUES
  (170, 'Alice'),
  (180, 'Bob')) AS "a2"("height", "name")
  ON CAST("a1"."name" AS TEXT) = CAST("a2"."name" AS TEXT))
SELECT UPPER(CAST("name" AS TEXT)) AS "name", 'rest_of_columns_that_werent_changed' FROM joined