datafusion-contrib / datafusion-python

Python binding for DataFusion
https://arrow.apache.org/datafusion/python/index.html
Apache License 2.0
59 stars 12 forks source link

Should PyDataFrame.collect() return a Table? #23

Open wjones127 opened 2 years ago

wjones127 commented 2 years ago

Right now it returns List[pa.RecordBatch], but it might be more natural to return a pa.Table. For one thing, they have a better repr provided by PyArrow.

matthewmturner commented 2 years ago

Asides from repr, do you see any other advantages?

houqp commented 2 years ago

This is to keep the signature in sync with what we have in the Rust core. Perhaps it would be better to add a new method to return a pa.Table instead.

wjones127 commented 2 years ago

Asides from repr, do you see any other advantages?

Mostly was just surprised coming from PyArrow, but it sounds like Rust usually just represents results as a sequence of record batches.

Perhaps it would be better to add a new method to return a pa.Table instead.

Yeah perhaps that's a better path. A to_table() method is common in PyArrow. If we eventually get the C Streaming data interface implemented in arrow-rs, we could also provide a to_reader().