add support for toArrow()

eakmanrq / sqlframe

Turning PySpark Into a Universal DataFrame API

https://sqlframe.readthedocs.io/en/stable/

MIT License

191 stars 3 forks source link

add support for toArrow() #54

Closed djouallah closed 5 days ago

djouallah commented 1 month ago

finally it was added to Spark

https://github.com/apache/spark/pull/45481

eakmanrq commented 1 month ago

Nice. Looks like this will be added in 4.0.0. I will start planning for how SQLFrame will support this.

djouallah commented 1 month ago

gentle ping as i see you fixed already schema

eakmanrq commented 1 month ago

Yeah I was planning on starting to support the 4.0.0 API once it is actually released. Can you provide an example of how you want to use toArrow with SQLFrame? I'm assuming it is for DuckDB right?

djouallah commented 1 month ago

Copy data to Delta table using the python writer which accepted arrow table as input

eakmanrq commented 1 month ago

Would you be doing this using DuckDB? I'm not sure what you mean by python writer.

djouallah commented 1 month ago

Yes DuckDB

eakmanrq commented 1 month ago

Thanks for the details @djouallah. I do intend to do this, but just not sure on the timeline right now. I'm still trying to cover more gaps in the current DataFrame API coverage first.

djouallah commented 1 month ago

here is my pitch, let's early adopter have this pattern working, raw data --- sqlframe --- arrow--- delta/iceberg and you will get the remaining gaps in the fullness of time :)

eakmanrq commented 1 month ago

Yeah I see your point. I'm good with allocating some time to this and see what the complexity looks like. I see the value in implementing this pattern. Going to fix some remaining issues and then start exploring this.