datafusion-contrib / datafusion-python

Python binding for DataFusion
https://arrow.apache.org/datafusion/python/index.html
Apache License 2.0
59 stars 12 forks source link

Question - Can `datafusion-python` be used without pyarrow? #22

Open matthewmturner opened 2 years ago

matthewmturner commented 2 years ago

I feel odd even asking this - but is it possible to make enhancements so that datafusion-python can be used without pyarrow? pyarrow is fantastic and I already use it, but, it is fairly large which makes it somewhat painful to deploy for some serverless use cases (such as on AWS Lambda). If I am able to do everything I need in datafusion is there a need for pyarrow? I confess I'm not very familiar with the interface between rust / datafusion and python / arrow so hopefully this isnt too stupid of a question.

thx!

wjones127 commented 2 years ago

I think it might be possible; a good portion of the module doesn't require PyArrow. The only things that do are UDFs, UDAFs, and the parts of the Dataframe API that return PyArrow data structures (like collect(), and schema()). Does a datafusion-python without those features sound appealing?

matthewmturner commented 2 years ago

Cool - that was what it looked like to me as well from my scan of the code. IMHO in the medium term it would be nice to have pyarrow as an optional feature. I think that datafusion should have some improvements on the IO front though before enabling this (im looking into / working on writing capabilities https://github.com/apache/arrow-datafusion/issues/1777). Right now I think pyarrow has more functionality there which is useful.