eakmanrq / sqlframe

Turning PySpark Into a Universal DataFrame API
https://sqlframe.readthedocs.io/en/stable/
MIT License
191 stars 3 forks source link

Missing .transform function and .withColumns #45

Closed cristian-marisescu closed 1 month ago

cristian-marisescu commented 1 month ago

Hi, first of all, nice project. I'm really rooting for it as I'm facing the same issues you mentioned.

I started testing it on my codebase, but I quickly ran into missing functions.

I use a lot of .transform: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.transform.html

and .withColumns: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumns.html

And this brings me to my next question: What's the plan in keeping up with updates/functions from either spark or other related engines?

Thanks in advance and again, great work!

eakmanrq commented 1 month ago

Thanks for the feedback!

SQLFrame does have transform support although it doesn't work on many engines since they don't support that function: https://github.com/eakmanrq/sqlframe/blob/main/sqlframe/base/functions.py#L1590-L1598

What engine are you running against?

Looks like withColumns is currently missing. Will add that tonight!

What's the plan in keeping up with updates/functions from either spark or other related engines?

Similar approach to what other projects like SQLGlot do: Add initial support for the most common operations and then add additional functions as requested. The PySpark API is very big so 100% isn't realistic at first but I will add features as requested and quickly cover the most common operations. One it is close to 100% I can run tests to identify gaps as automatically as new versions are released but it is a bit early in the product's development to achieve that today. This same thinking applies to other engines.

cristian-marisescu commented 1 month ago

Thank you for the fast and clear response.

I was running it with duckdb, something along the lines.

from sqlframe.duckdb import DuckDBDataFrame 
from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb import functions as F

def generic_transformer(df):
      #some actions
      return transformed_df

my_initial_df.transform(generic_transformer)

getting

TypeError: 'Column' object is not callable

same TypeError, on calling .withColumns

eakmanrq commented 1 month ago

Oh it looks like you are using the DataFrame transform method: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.transform.html

(not the transform function itself)

I will look at adding that tonight too!

In terms of the error, SQLFrame assumes if you do df.<whatever> and <whatever> is not found, then you must be referencing a column. So since df.transform and df.withColumns are not currently supported, it gives you that strange error. Will think about how to improve that since it could be a common issue.

cristian-marisescu commented 1 month ago

You're right, I just checked now and saw I pasted the wrong thing.

Thank you for all the help and indeed, +1 to the Error Handling.

eakmanrq commented 1 month ago

Your feedback has been addressed with 1.6.0: https://github.com/eakmanrq/sqlframe/releases/tag/v1.6.0

Please open an issue for any other issues you may have!