apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
363 stars 137 forks source link

Consolidate FileIO #310

Closed kevinjqliu closed 2 weeks ago

kevinjqliu commented 7 months ago

Feature Request / Improvement

Can we consolidate and standardize FileIO to the PyArrow implementation?

There are currently two different FileIO implementations, ARROW_FILE_IO and FSSPEC_FILE_IO. ARROW_FILE_IO uses Apache Arrow's Filesystem Interface while FSSPEC_FILE_IO uses the fsspec library.

Here are a few reasons for consolidating:

  1. PyArrow is already preferred over FsSpec for various FS implementations. https://github.com/apache/iceberg-python/blob/cd7fb502900a717d6b902a398b267eb10e4faa9b/pyiceberg/io/__init__.py#L273-L282

  2. PyIceberg is becoming more coupled with PyArrow, to_arrow() and pa.Table are widely used for reading and writing, including the new feature #305

  3. Easier to keep the 2 FileIO's behavior in sync. For example, FsSpec defaults the path with no scheme (/tmp/warehouse) to the file scheme, but PyArrow does not. See #301

  4. The two FileIO implementations are not that different from one another. FsSpec can use its underlying FS implementations, including LocalFileSystem, S3FileSystem, GCSFileSystem, and AzureBlobFileSystem. While PyArrow uses its FS implementations including LocalFileSystem, S3FileSystem, HadoopFileSystem, and GcsFileSystem. PyArrow is currently missing the HadoopFileSystem implementation but it has support for HDFS.

  5. Fsspec and PyArrow can be used directionally PyArrow can use fsspec-based filesystem. FsSpec can wrap PyArrow filesystem.

Fokko commented 7 months ago

What would be your proposal? The FileIO is an abstraction layer to use different implementations for your needs. For example, fsspec is lightweight compared to Arrow and might be preferred if you are inside of a lambda/cloud function or in an orchestration engine like Apache Airflow. As you mentioned, Arrow is more equipped to read tables. Next to that, PyIceberg is designed to be used as a library as part of a query engine. If that query engine prefers a different implementation to fetch the data from an object store, the FileIO abstraction layer allows for that.

kevinjqliu commented 6 months ago

I see. I was under the assumption that PyArrow could completely replace fsspec. But it seems like there are a few use cases where we would prefer fsspec.

fsspec is lightweight compared to Arrow

Looks like this is right; fsspec is a fraction of the size. https://pypi.org/project/fsspec/#files https://pypi.org/project/pyarrow/#files

Going forward, I think we can address (3) above and refactor fsspec and pyarrow to have the same specs and behaviors. And maybe also address (5) so that we can interchange fsspec and pyarrow easily.

github-actions[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 2 weeks ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'