apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
483 stars 177 forks source link

[Bug] Cannot use PyIceberg with multiple FS #1041

Open kevinjqliu opened 3 months ago

kevinjqliu commented 3 months ago

Apache Iceberg version

main (development)

Please describe the bug 🐞

PyIceberg assumes the same FS implementation is used for reading both metadata and data. However, I want to use a catalog with local FS as the warehouse while referencing S3 files as data.

See this example Jupyter notebook to reproduce

Problem

The fs implementation is determined by metadata location, which is then passed down to the function which reads the data file. https://github.com/apache/iceberg-python/blob/d8b5c17cadbc99e53d08ade6109283ee73f0d83e/pyiceberg/io/pyarrow.py#L1428-L1430

Possible solution

Determine fs implementation based on the file path of the current file

Fokko commented 3 months ago

This is a good point, I've heard that folks store their metadata on HDFS, and the data itself on S3.

I don't think the example with the add-files is the best, it would be better to support the write.data.path and write.metadata.path configuration

kevinjqliu commented 2 months ago

Oh interesting, thanks! Here's the config definition for write.data.path and write.metadata.path https://iceberg.apache.org/docs/latest/configuration/#write-properties

I also found an old issue referencing this #161

Another example usage is to use both local FS and S3(minio), which might be easier to set up and test against

TiansuYu commented 2 months ago

I will have a look this issue.

TiansuYu commented 2 months ago

A generic question: why we are implementing a custom https://github.com/apache/iceberg-python/blob/dc6d2429aafbffc626cba53aaac3f6198fc37eb3/pyiceberg/io/__init__.py#L320-L329

https://github.com/apache/iceberg-python/blob/dc6d2429aafbffc626cba53aaac3f6198fc37eb3/pyiceberg/io/__init__.py#L290-L300

instead of using fsspec.core.url_to_fs(file_path)[0] directly?

(On a side note: this looks a bit confusing to me, as why for gs, FSSPEC_FILE_IO is not added as a viable io method as we have added gcfs into extras.)

(Also I am not sure why pyarrow is not using fsspec as io layer but implement things on their own.)

EDIT: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.url_to_fs is a slightly better choice than fsspec.core.get_fs_token_path

TiansuYu commented 2 months ago

Read my comment here for the cause of the issue.

I dont think fixing SqlCatalog alone is the proper answer to this bug. The io layer seems to me ill written and has to be fixed somewhere in the uppper level (e.g. FsspecInputFile or InputFile).

Let me know what do you think, then we can come up with a way to properly address this. 😄

kevinjqliu commented 2 months ago

Thanks for taking a look at this @TiansuYu

why we are implementing a custom

I think custom scheme parsing avoids picking one library over another (fsspec vs pyarrow). fsspec.core.get_fs_token_paths seems like an interesting replacement when using fsspec.

(On a side note: this looks a bit confusing to me, as why for gs, FSSPEC_FILE_IO is not added as a viable io method as we have added gcfs into extras.)

Good catch, fsspec should be included since GCS is supported https://github.com/apache/iceberg-python/blob/dc6d2429aafbffc626cba53aaac3f6198fc37eb3/pyiceberg/io/fsspec.py#L161-L176

(Also I am not sure why pyarrow is not using fsspec as io layer but implement things on their own.)

pyarrow is using its own native fs implementations https://github.com/apache/iceberg-python/blob/dc6d2429aafbffc626cba53aaac3f6198fc37eb3/pyiceberg/io/pyarrow.py#L348-L403

kevinjqliu commented 2 months ago

I dont think fixing SqlCatalog alone is the proper answer to this bug. The io layer seems to me ill written and has to be fixed somewhere in the uppper level (e.g. FsspecInputFile or InputFile).

yea, the main issue is the assumption that the same io (and fs implementation) is used for reading both data and metadata files. The example you pointed to pass in the io parameter

https://github.com/apache/iceberg-python/blob/dc6d2429aafbffc626cba53aaac3f6198fc37eb3/pyiceberg/table/__init__.py#L655-L657

Instead, we would want to recreate io/fs based on the file currently being processed.

Here's another example of passing in the io parameter on the write path https://github.com/apache/iceberg-python/blob/dc6d2429aafbffc626cba53aaac3f6198fc37eb3/pyiceberg/table/__init__.py#L530-L532

kevinjqliu commented 2 months ago

Generally, this problem should go away if we re-evaluate fs and io each time a file is read and written. Or other words, we should stop passing the io parameter around.

TiansuYu commented 2 months ago

@kevinjqliu I think resolving fs at file level should make the API cleaner. We can e.g. if no file_system given to FsspecInputFile (or similarly PyarrowInputFile), then we resolve them at file level.

I would say one benefit one might want to set fs on table level is to reuse that fs instance for performance boost. If we want to keep this, I would say we need to make two io configs, one for metadata, one for data, on the MetastoreCatalog or Catalog level.

kevinjqliu commented 2 months ago

My preference is resolving fs at the file level. It's more flexible and the performance difference should be negligible. Another reason is to be able to write data across clouds. Technically, I can write to multiple clouds, across AWS, Azure, and GCP.

TiansuYu commented 2 months ago

I will make a PR according to this:

My preference is resolving fs at the file level. It's more flexible and the performance difference should be negligible. Another reason is to be able to write data across clouds. Technically, I can write to multiple clouds, across AWS, Azure, and GCP.

TiansuYu commented 2 months ago

Also reading on here: https://arrow.apache.org/docs/python/filesystems.html#using-arrow-filesystems-with-fsspec

There might be some opportunity that we can simplify the split between arrow and fsspec file_system.

kevinjqliu commented 2 months ago

yep! There are definitely opportunities to consolidate the two. I opened #310 with some details.

TiansuYu commented 2 months ago

Reading on table spec, I just realised that there is a field location in https://iceberg.apache.org/spec/#table-metadata-fields that specifies a base location of the table. Does Iceberg Java actually allows the split between metadata and data location?

kevinjqliu commented 2 months ago

Its configurable via the write properties. See this comment https://github.com/apache/iceberg-python/issues/1041#issuecomment-2323380629