apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.59k stars 790 forks source link

DISCUSS: Should we integrate Apache OpenDAL support in Parquet? #5427

Open Xuanwo opened 8 months ago

Xuanwo commented 8 months ago

Hello community,

I'm interested in integrating Apache OpenDAL support into parquet (and potentially other crates) to enable more users to benefit from this first-class support. This integration allows users of both parquet and opendal to enjoy a better experience.

Do you think it would be a valuable addition?

Backgroud

OpenDAL is a unified data access layer designed to simplify interactions with various storage backends, ranging from AWS S3 to Google Drive and more. Its integration into parquet could allow opendal users to use parquet more easily and allowing parquet users to visit more storage backends.

Although it's possible to use opendal::Reader as AsyncRead + AsyncSeek in the parquet ParquetRecordBatchStream, its performance isn't optimal compared to ParquetObjectReader, which directly utilizes the object_store API.

Benefits

OpenDAL, a graduated Apache project with robust community support, boasts 22 committers and 182 contributors. This integration could potentially bring additional committers to the Arrow community.

This integration enhances the user experience for both parquet and opendal. Users of parquet gain easy access to additional storage services, while opendal users can seamlessly integrate with parquet.

Plan

For implementation

Thanks to parquet & object_store perfect design, we can:

For maintenance

Many opendal committers heavily utilize parquet. If this proposal is accepted, the opendal community (like me) will actively maintain the OpenDAL component, including its API, documentation, tests, and CI. Additionally, the opendal community will be responsible for managing bug reports and feature requests related to OpenDAL.

Alternatives

Why not through object_store_opendal?

While it's feasible for opendal to function as an object_store::Store via object_store_opendal, this approach requires extra effort from the user and limits optimization opportunities for OpenDAL, such as selecting read methods based on if service has native seek support.

Why not implement externally?

Although implementing this feature externally is possible, providing native support simplifies discovery and adoption for users.

For examples, some users have to implement their own parquet reading logic:

By contributing ParquetOpendalReader upstream, we can unite community efforts.

tustvold commented 8 months ago

I wonder if you've given thought to an opendal_parquet crate? This could then be maintained alongside opendal itself perhaps? This would allow development of the two to happen together in the same repository, something that has benefited the design of object_store and parquet, which have been explicitly designed to complement each other.

In general I'm relatively lukewarm about adding third-party dependencies where not strictly necessary

sundy-li commented 8 months ago

There are some features that Opendal may need to support to replace object_store

https://github.com/apache/opendal/issues/3675

Xuanwo commented 8 months ago

replace object_store

Hi, this issue is not for discussing replace object_store.

Xuanwo commented 8 months ago

I wonder if you've given thought to an opendal_parquet crate?

Thanks for your suggestion! We do have have an alternative plan to implment a parquet_opendal_io instead.


I initiated this discussion to understand upstream's perspective on this matter. I'm eager to contribute this to the upstream if the community think it's a good fit. However, I'm also okay with staying out if the community would rather not include it.

The most important thing is that I wanted the community to be the first to know.

tisonkun commented 8 months ago

I wonder if you've given thought to an opendal_parquet crate?

I suppose the related concerns are covered by the "Why not implement externally?" section. Do you have further questions @tustvold?

tustvold commented 8 months ago

I think I would prefer this to be implemented externally. The parquet crate is designed to work well with the IO abstractions of object_store, and introducing a second different IO abstraction, along with a much broader set of storage systems, will present us with conflicting design constraints.

Ultimately filesystem style abstractions don't work well for performing vectored IO against files in object storage, as is needed by a parquet reader. https://github.com/apache/opendal/issues/3675 is a manifestation of this. https://github.com/apache/arrow-rs/issues/1473 contains some of the background behind the design of the parquet IO abstractions and was informed by the experiences of the Spark ecosystem discussed in https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1100069800.

By keeping object_store as the recommended first-party IO abstraction we can ensure users get the performance they expect out of the box.

alamb commented 8 months ago

In my opinion it is strange to have object_store support but not OpenDAL support in the parquet crate.

What do we think about moving the object_store support out of the parquet crate (and into an external parquet_objectstore for example)?

This would make the boundaries between crates more clear perhaps, ensuring that the core parquet reading functionality was decoupled from any particular IO API / implementation?

If we chose not to put the OpenDAL support in the parquet crate, I think we could at least add some links to it in the parquet documentation to help with discoverability

tustvold commented 8 months ago

In my opinion it is strange to have object_store support but not OpenDAL support in the parquet crate.

What do we think about moving the object_store support out of the parquet crate (and into an external parquet_objectstore for example)?

As the first-party IO abstraction for the arrow Rust ecosystem, and the one that parquet was designed to interoperate with, IMO it would be strange not to include it. I would be strongly against such a change

alamb commented 8 months ago

As the first-party IO abstraction for the arrow Rust ecosystem, and the one that parquet was designed to interoperate with, IMO it would be strange not to include it.

I am not sure what is meant by "first-party IO abstraction" - just because object_store has the same maintainers doesn't mean it necessarily has to be bundled into the same crate.

The parquet crate existed prior to the object_store crate and I don't think there is any technical reason parquet needs to depend on it (e.g. it is an optional dependency).

I would be strongly against such a change

What is your concern?

Potential concerns I can think of are:

  1. Software release management overhead (new crates to make/manage/release)?
  2. Development overhead (e.g. if changes are required to APIs in both crates simultaneously)?
  3. API deviation (ome APIs in parquet may be changed in such as way as to not work as well with object_store)?
  4. Something else?
tustvold commented 8 months ago

What is your concern?

All three of the above, but mainly that encouraging people to use object_store ensures they get the performance and behaviour they expect, with the common maintainer base allowing for efficient triage when people inevitably run into issues. It is already a significant undertaking supporting what we do currently, I'm less than enthusiastic about adding new storage backends and IO abstractions with very different behaviours, performance characteristics and feature sets.

alamb commented 8 months ago

but mainly that encouraging people to use object_store ensures they get the performance and behaviour they expect, with the common maintainer base allowing for efficient triage when people inevitably run into issues.

I am not sure that everyone has the same performance expectations or that object_store is the best interface for all uses of reading parquet.

I would that @Xuanwo and the rest OpenDAL team would be the ones to triage issues related to reading/writing to parquet using OpenDAL and thus I do think having it in a separate crate would be good.

It is already a significant undertaking supporting what we do currently, I'm less than enthusiastic about adding new storage backends and IO abstractions with very different behaviours and performance characteristics.

Right, this is why I was thinking it would actually help maintenance if we separated the IO parts (object_store) from the parquet encoding/decoding parts. That way it would be clear where the responsibilities lay

tustvold commented 8 months ago

Right, this is why I was thinking it would actually help maintenance if we separated the IO parts (object_store) from the parquet encoding/decoding parts. That way it would be clear where the responsibilities lay

I think it is important that a parquet implementation provides first party support for object stores, and this is ultimately the use-case object_store was developed for. Much like arrow-cpp has ArrowFilesystem, and Hadoop has HadoopFilesystem, almost all data analytics ecosystems provide some first-party IO abstraction they're designed to work well with.

That's not to say we shouldn't provide extension points for people to do different things, but it makes sense, at least to me, that a parquet crate provides a first-party way to read parquet data stored in object storage given how common this pattern is.