datafusion-contrib / datafusion-python

Python binding for DataFusion
https://arrow.apache.org/datafusion/python/index.html
Apache License 2.0
59 stars 12 forks source link

Draft PyArrow Dataset reader impl #21

Closed wjones127 closed 2 years ago

wjones127 commented 2 years ago

Work in progress. Working toward being able to stream record batches from a PyArrow dataset.

Fixes #10.

wjones127 commented 2 years ago

FYI I am going to set this aside for now, since I think this really needs the Arrow C Stream Interface to be reliable. Right now it just wraps a Python iterator and holds onto the GIL while waiting for the PyArrow scanner to generate each batch. I'm running into some GIL deadlocks, so it would be nice to eliminate the GIL stuff from record batch streaming.

kylebrooks-8451 commented 2 years ago

I believe I have a working solution for this that I developed for the company I work for. I will get a PR out there soon. Is there still a need for this?

wjones127 commented 2 years ago

I believe I have a working solution for this that I developed for the company I work for. I will get a PR out there soon. Is there still a need for this?

This was mostly an experimental curiosity, but a PR would be cool if you are willing :)

I probably won't get around to finishing this for a while.

kylebrooks-8451 commented 2 years ago

@wjones127 I've created a PR, #59

I couldn't add you as a reviewer after I made the PR but I'd love to have your feedback on it.