Consider Using object_store as IO Abstraction

tustvold commented 7 months ago

I have debated filing this ticket for a while, but largely held off as I wasn't sure how well it would be received, especially as I am acutely aware that this crate currently makes use of OpenDAL and @Xuanwo is an active contributor to both repositories. However, I feel it is important to have these discussions, and part of my role as a maintainer of object_store is to engage with others in the community and hear about how its offering could be made more compelling.

That all being said, I think object_store provides some quite compelling functionality that might be of particular interest to this project:

First-party integration with arrow-rs, parquet, DataFusion and polars, including sophisticated vectored and streaming IO
Support for conditional writes, which would allow iceberg-rs to support multiple concurrent writers directly against object storage, without needing an external catalog
A flexible configuration system developed in partnership with, and used by both the polars and delta-rs communities
Extensive support for the various cloud provider credential sources, with extension points for users to further customise this
APIs that mirror that of object stores and not filesystems, which helps to understand what and how IO is being performed, and allows support for object store specific functionality like tags, partial range requests, and more...
Battle-tested in multiple production systems, and with a substantial and growing user-base

The major area object_store is limited, somewhat intentionally, is in the number of first-party implementations; only supporting S3-compatible stores, Google Cloud Storage, Azure Blob Storage, in-memory and local filesystems. However, the object-safe design does allow for third-party implementations, for things like HDFS.

I look forward to hearing your thoughts, but also fully understand if this is not a discussion you would like to engage with at this time.

alamb commented 7 months ago

cc @liurenjie1024

liurenjie1024 commented 7 months ago

Hi, @tustvold @alamb Thanks for this proposal and write up, object_store looks great to me!

In iceberg's design, all file ios are hidden under the FileIO interface, and the backends, i.e. OpenDAL or object_store are not directly exposed to user, so I think we can integrate it without any breaking changes.

Currently OpenDAL works well for us and we are focusing on implementing more features for iceberg-rust, so it may take a while for us to evaluate object_store and integrate it into this crate.

First-party integration with arrow-rs, parquet, DataFusion and polars, including sophisticated vectored and streaming IO

I'm quite interested in this since we are about to add support for file reader/writer, which will heavily depend on arrow-rs, parquet, etc, so I think object_store is quite promising.

liurenjie1024 commented 7 months ago

cc @Xuanwo

Xuanwo commented 7 months ago

Hi @tustvold, thank you for initiating this discussion! I will do my best to offer a multifaceted response with different hat.

Put iceberg-rust developers hat on

As @liurenjie1024 mentioned, iceberg-rust features its own FileIO interface to abstract IO operations. OpenDAL and object_store are merely implementation details with no current plans for external exposure.

It's fine to integrate with object_store, as that is precisely what we created FileIO for. However, it's important to note that we are in the initial stages of this project: currently focusing on the first release and implementing read/write capabilities.

Here are some remarks regarding the object_store feature set:

A flexible configuration system developed in partnership with, and used by both the polars and delta-rs communities

iceberg-rust is aligned with Iceberg and PyIceberg, sharing the same configuration logic; therefore, the object_store's configuration system is redundant for our purposes.

Support for conditional writes, which would allow iceberg-rs to support multiple concurrent writers directly against object storage, without needing an external catalog

While the conditional put feature offers certain advantages, it may not be as crucial for our current use cases in iceberg-rust, where integration with a catalog like Hive or REST is more common.

As an iceberg-rust developer, I am eager to unlock more potential within the project.

Put OpenDAL maintianer hat on

Firstly, opendal and object_store are not competitors. (And remember, I'm also a contributor to object_store!) Rather than discussing replacements, I'd prefer to explore how we can coexist to offer our users more choices and possibilities.

I believe opendal integrates seamlessly with object_store, which is why our community created object_store_opendal, enabling users to utilize opendal as an implementation of object_store.

Here are a few reasons why OpenDAL is beneficial for iceberg-rust.

OpenDAL offers native support for OSS, B2, HDFS, and WebHDFS in addition to the existing S3, GCS, and AzBlob. All have passed identical behavior test suites, simplifying integration for users without fear of unexpected breaking.
OpenDAL offers a comprehensive API that supports range retrieval and conditional fetching through the robust read_with() function.
OpenDAL enables users to freely utilize its API. For instance, they can directly use Writer without needing to understand MultipartUpload.
OpenDAL features powerful layers such as retry, logging, tracing, metrics, prometheus, timeout, and more that can significantly reduce the workload typically associated with managing these aspects manually.
OpenDAL features object_store_opendal integration, enabling seamless connection to existing object_store-based systems.

I also found some places that OpenDAL can improve (Thanks @tustvold!):

As an OpenDAL maintainer, I believe OpenDAL offers features that could be beneficial for iceberg-rust, potentially simplifying some aspects of storage management. And I will be happy to collaborate with object_store to ensure the success of iceberg-rust.

tustvold commented 7 months ago

Thank you both for the responses.

In iceberg's design, all file ios are hidden under the FileIO interface, and the backends, i.e. OpenDAL or object_store are not directly exposed to user, so I think we can integrate it without any breaking changes.

Glad to here efforts are being made to keep the IO primitives abstracted and pluggable 👍. I would just observe that FileIO appears to mirror filesystem APIs, and that this has historically been a pain point in systems that chose this path. For example Spark has had a very hard time getting a performant S3 integration, with proper vectored IO only being added to OSS Spark very recently. By contrast the object_store APIs mirror those of the actual stores, and are designed to work well with the APIs in arrow-rs, avoiding all the complexities of prefetching heuristics and similar.

discussing replacements

I entirely agree, I guess I was more suggesting that the IO abstraction mirror object_store as this is what both the upstream crates use and expect, and what the underlying stores provide. If people then wanted additional backend support they could plug OpenDAL into this interface?

I'm quite interested in this since we are about to add support for file reader/writer

I'd be happy to help out with this, if you're open to contributions, both myself and my employer are very interested in native iceberg support for the Rust ecosystem

alamb commented 7 months ago

Thank you all -- this is a great conversation.

I entirely agree, I guess I was more suggesting that the IO abstraction mirror object_store as this is what both the upstream crates use and expect, and what the underlying stores provide. If people then wanted additional backend support they could plug OpenDAL into this interface?

I took a look at the FileIO interface that @liurenjie1024 and @Xuanwo pointed it. Eventually they seem to provide something that implements AsyncRead and AsyncWrite

While it is true that AsyncRead and AsyncWrite's interfaces (seek, random IO, etc) can be used in such a way that would perform very poorly for remote object storage, I think if users are judicious and provide sufficients hints, and buffer the reads the performance difference will be negligible.

The "benefit" that one might get from using object_store is that its API is more opinionated and makes it very awkward to use poorly

In my opinon, the use of OpenDAL to connect to more storage systems other than object stores is pretty compelling.

Perhaps as you proceed integrating iceberg-rust with arrow-rs/parquet/datafusion we will learn more about how these various systems can be integrated and if any adjustments need to be made, either in OpenDAL or object_store or downstream in some other crates

liurenjie1024 commented 7 months ago

Thanks everyone for this very nice discussion.

I'd be happy to help out with this, if you're open to contributions, both myself and my employer are very interested in native iceberg support for the Rust ecosystem

Of course we are open to contributions from everyone, and that's the key spirit of open source project. Please note that this is an apache project, and everyone is welcome to contribute.

As with the FileIO interface, it's inspired by iceberg's java/python implementation. I have to admit that I don't have much experience working with object store such as s3, and I don't know much about its difference with file systems such as hdfs. I believe the whole iceberg community welcomes ideas and design as long as it's reasonable and provides benefits for performance.

tustvold commented 7 months ago

I think if users are judicious and provide sufficients hints, and buffer the reads the performance difference will be negligible.

If primarily performing sequential IO I would tend to agree, the AsyncRead abstraction will be less efficient than a streaming request, but if pre-fetching is configured appropriately the end-to-end latency should be similar. However, it is "random" IO such as occurs when reading structured file formats like parquet, that this difference becomes more stark.

Fortunately the fix is extremely simple, adding InputFile::get_ranges that can be called by AsyncFileReader. This can then call through to vectorised IO primitives where supported.

Of course we are open to contributions from everyone In iceberg's design, all file ios are hidden under the FileIO interface

Would you be open to a PR to allow using either OpenDAL or object_store, along with corresponding feature flags, or would you prefer to not complicate matters at this time? I think this could be achieved in a fairly unobtrusive manner.

Fokko commented 7 months ago

Thanks @tustvold for raising this and please don't hesitate to open an issue or PR.

For example Spark has had a very hard time getting a performant S3 integration, with proper vectored IO only being added to OSS Spark https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1100069800.

This is why the Iceberg Java implementation ships with its own vectorized parquet reader :)

It looks to me that object_store and FileIO aim to solve the same problem. Iceberg is designed to work on object stores from the start, and not on filesystems. Similar to object_store the FileIO concept is very opinionated. Since many people are still on HDFS, this is also supported since Filesystems offer stronger guarantees than object stores. If you want to learn more about the FileIO concept, this is a good primer on the concept.

tustvold commented 7 months ago

It looks to me that object_store and FileIO aim to solve the same problem

That's awesome, thank you for the link. That is exactly what object_store is, an opinionated abstraction that ensures workloads are not overly reliant on filesystem-specific APIs and behaviour. Really cool that the iceberg community chose to take this approach, I agree with it wholeheartedly :+1:

FWIW I notice that the InputFile contract is not vectorised itself, but I guess if you have a custom parquet reader you could lift the range coalescing into it.

liurenjie1024 commented 7 months ago

Would you be open to a PR to allow using either OpenDAL or object_store, along with corresponding feature flags, or would you prefer to not complicate matters at this time? I think this could be achieved in a fairly unobtrusive manne

Hi, @tustvold Welcome to open pr for this.

About the timing, my suggestion is to wait for a moment. Currently this crate has finished rest catalog and serialization/deserialization of metadata, basic file based table scan planning. We are expecting to implement two things following: a parquet file writer which writes arrow record batch, and reading parquet file to arrow record batch stream. These two features depends on FileIO a lot, and would provide solid and concrete use cases for our new io interface, so that we can have better understanding and discussion about the benefits of these changes. What do you think?

apache / iceberg-rust

Consider Using object_store as IO Abstraction #172

Put iceberg-rust developers hat on

Put OpenDAL maintianer hat on