eakmanrq / sqlframe

Turning PySpark Into a Universal DataFrame API
https://sqlframe.readthedocs.io/en/stable/
MIT License
174 stars 3 forks source link

Multiple joins on same key throws ambiguous error #103

Closed cristian-marisescu closed 1 week ago

cristian-marisescu commented 1 week ago

Hi,

When doing join chains, on the same key, the key becomes ambiguous.


from sqlframe.duckdb import DuckDBSession

spark = DuckDBSession()

df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], ["age", "name"])
height = spark.createDataFrame([(170, "Alice"), (180, "Bob")], ["height", "name"])
location = spark.createDataFrame(
    [("USA", "Alice"), ("Europe", "Bob")], ["location", "name"]
)

df = df.join(height, how="left", on="name").join(location, how="left", on="name")

print(df.sql())
print(df.show())

throws

File "/workspaces/playground.py", line 17, in <module>
    df = df.join(height, how="left", on="name").join(location, how="left", on="name")
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/operations.py", line 48, in wrapper
    df: t.Union[_BaseDataFrame, _BaseGroupedData] = func(self, *args, **kwargs)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/dataframe.py", line 875, in join
    raise ValueError(
ValueError: Column name is ambiguous. Please specify the table name.

This behavior is working in pyspark, producing

+-----+---+------+--------+                                                     
| name|age|height|location|
+-----+---+------+--------+
|Alice|  2|   170|     USA|
|  Bob|  5|   180|  Europe|
+-----+---+------+--------+