eakmanrq / sqlframe

Turning PySpark Into a Universal DataFrame API
https://sqlframe.readthedocs.io/en/stable/
MIT License
174 stars 3 forks source link

Conversion between python string and pyspark/duckdb column #99

Closed cristian-marisescu closed 1 week ago

cristian-marisescu commented 1 week ago

Hi,

Simple python strings that are passed and will be converted by pyspark to lit within a column, are mistaken for actual column names in duckdb engine:

Take the following code

from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb import functions as F

spark = DuckDBSession()
df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], ["age", "name"])
df = df.withColumn(
    "python_string_to_pyspark_string",
    F.when(F.col("name") == "Alice", "Something").otherwise("Default"),
)

print(df.sql())
print(df.show())

This would fail with

 File "/workspaces/playground.py", line 17, in <module>
    print(df.sql())
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/dataframe.py", line 588, in sql
    exp.Select, self.session._optimize(select_expression, dialect=dialect)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/session.py", line 422, in _optimize
    return optimize(expression, dialect=dialect, schema=self.catalog._schema)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlglot/optimizer/optimizer.py", line 92, in optimize
    optimized = rule(optimized, **rule_kwargs)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlglot/optimizer/qualify.py", line 90, in qualify
    validate_qualify_columns_func(expression)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlglot/optimizer/qualify_columns.py", line 102, in validate_qualify_columns
    raise OptimizeError(f"Column '{column}' could not be resolved{for_table}")
sqlglot.errors.OptimizeError: Column '"alice"' could not be resolved

That's because the== "Alice" is interpreted as column name, not lit of a column.

Pyspark will automatically convert the == "Alice" to == F.lit("Alice") and specifying this would work with duckdb as well.

What do you think?

eakmanrq commented 1 week ago

Thanks for sharing this. I didn't realize this is how the DataFrame API handled this case. Fix will be in next release.