eakmanrq / sqlframe

Turning PySpark Into a Universal DataFrame API
https://sqlframe.readthedocs.io/en/stable/
MIT License
174 stars 3 forks source link

F.size() fails on duckdb - Binder Error #87

Closed cristian-marisescu closed 2 weeks ago

cristian-marisescu commented 2 weeks ago

Hi,

F.size is failing due to wrong selected function

Code

from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb import functions as F

spark = DuckDBSession()

initial: DataFrame = spark.createDataFrame(
    [
        ("data1"),
        ("data2"),
    ],
    ["data_column"],
)

size_test: DataFrame = initial.select("*", F.size("data_column").alias("size_of_data_columns"))
size_test.show()

Error

Traceback (most recent call last):
  File "/workspaces/playground.py", line 18, in <module>
    size_test.show()
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/dataframe.py", line 1555, in show
    result = self.session._fetch_rows(sql)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/session.py", line 455, in _fetch_rows
    self._execute(sql, quote_identifiers=quote_identifiers)
  File "/workspaces/.venv/lib/python3.10/site-packages/sqlframe/base/session.py", line 427, in _execute
    self._cur.execute(self._to_sql(sql, quote_identifiers=quote_identifiers))
duckdb.duckdb.BinderException: Binder Error: No function matches the given name and argument types 'array_length(VARCHAR)'. You might need to add explicit type casts.
        Candidate functions:
        array_length(ANY[]) -> BIGINT
        array_length(ANY[], BIGINT) -> BIGINT

LINE 1: ...data_column)), t22909785 AS (SELECT *, ARRAY_LENGTH(data_column) AS size_of_da...

generated SQL on size_test.sql():

SELECT
  CAST("a1"."data_column" AS TEXT) AS "data_column",
  ARRAY_LENGTH(CAST("a1"."data_column" AS TEXT)) AS "size_of_data_columns"
FROM (VALUES
  ('data1'),
  ('data2')) AS "a1"("data_column")

Seems like wrong sql is generated. I think it should be LENGTH, not ARRAY_LENGTH. Running the sql with LENGTH works in DuckDB.

eakmanrq commented 2 weeks ago

Size is for array_length: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.size.html

The function you are wanting is length: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.length.html