Open Xuanwo opened 8 months ago
I wonder if you've given thought to an opendal_parquet crate? This could then be maintained alongside opendal itself perhaps? This would allow development of the two to happen together in the same repository, something that has benefited the design of object_store and parquet, which have been explicitly designed to complement each other.
In general I'm relatively lukewarm about adding third-party dependencies where not strictly necessary
There are some features that Opendal
may need to support to replace object_store
replace object_store
Hi, this issue is not for discussing replace object_store
.
I wonder if you've given thought to an opendal_parquet crate?
Thanks for your suggestion! We do have have an alternative plan to implment a parquet_opendal_io
instead.
I initiated this discussion to understand upstream's perspective on this matter. I'm eager to contribute this to the upstream if the community think it's a good fit. However, I'm also okay with staying out if the community would rather not include it.
The most important thing is that I wanted the community to be the first to know.
I wonder if you've given thought to an opendal_parquet crate?
I suppose the related concerns are covered by the "Why not implement externally?" section. Do you have further questions @tustvold?
I think I would prefer this to be implemented externally. The parquet crate is designed to work well with the IO abstractions of object_store, and introducing a second different IO abstraction, along with a much broader set of storage systems, will present us with conflicting design constraints.
Ultimately filesystem style abstractions don't work well for performing vectored IO against files in object storage, as is needed by a parquet reader. https://github.com/apache/opendal/issues/3675 is a manifestation of this. https://github.com/apache/arrow-rs/issues/1473 contains some of the background behind the design of the parquet IO abstractions and was informed by the experiences of the Spark ecosystem discussed in https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1100069800.
By keeping object_store as the recommended first-party IO abstraction we can ensure users get the performance they expect out of the box.
In my opinion it is strange to have object_store
support but not OpenDAL support in the parquet
crate.
What do we think about moving the object_store
support out of the parquet crate (and into an external parquet_objectstore
for example)?
This would make the boundaries between crates more clear perhaps, ensuring that the core parquet reading functionality was decoupled from any particular IO API / implementation?
If we chose not to put the OpenDAL support in the parquet crate, I think we could at least add some links to it in the parquet documentation to help with discoverability
In my opinion it is strange to have object_store support but not OpenDAL support in the parquet crate.
What do we think about moving the object_store support out of the parquet crate (and into an external parquet_objectstore for example)?
As the first-party IO abstraction for the arrow Rust ecosystem, and the one that parquet was designed to interoperate with, IMO it would be strange not to include it. I would be strongly against such a change
As the first-party IO abstraction for the arrow Rust ecosystem, and the one that parquet was designed to interoperate with, IMO it would be strange not to include it.
I am not sure what is meant by "first-party IO abstraction" - just because object_store
has the same maintainers doesn't mean it necessarily has to be bundled into the same crate.
The parquet crate existed prior to the object_store
crate and I don't think there is any technical reason parquet needs to depend on it (e.g. it is an optional dependency).
I would be strongly against such a change
What is your concern?
Potential concerns I can think of are:
parquet
may be changed in such as way as to not work as well with object_store
)?What is your concern?
All three of the above, but mainly that encouraging people to use object_store ensures they get the performance and behaviour they expect, with the common maintainer base allowing for efficient triage when people inevitably run into issues. It is already a significant undertaking supporting what we do currently, I'm less than enthusiastic about adding new storage backends and IO abstractions with very different behaviours, performance characteristics and feature sets.
but mainly that encouraging people to use object_store ensures they get the performance and behaviour they expect, with the common maintainer base allowing for efficient triage when people inevitably run into issues.
I am not sure that everyone has the same performance expectations or that object_store is the best interface for all uses of reading parquet.
I would that @Xuanwo and the rest OpenDAL team would be the ones to triage issues related to reading/writing to parquet using OpenDAL and thus I do think having it in a separate crate would be good.
It is already a significant undertaking supporting what we do currently, I'm less than enthusiastic about adding new storage backends and IO abstractions with very different behaviours and performance characteristics.
Right, this is why I was thinking it would actually help maintenance if we separated the IO parts (object_store
) from the parquet encoding/decoding parts. That way it would be clear where the responsibilities lay
Right, this is why I was thinking it would actually help maintenance if we separated the IO parts (object_store) from the parquet encoding/decoding parts. That way it would be clear where the responsibilities lay
I think it is important that a parquet implementation provides first party support for object stores, and this is ultimately the use-case object_store was developed for. Much like arrow-cpp has ArrowFilesystem, and Hadoop has HadoopFilesystem, almost all data analytics ecosystems provide some first-party IO abstraction they're designed to work well with.
That's not to say we shouldn't provide extension points for people to do different things, but it makes sense, at least to me, that a parquet crate provides a first-party way to read parquet data stored in object storage given how common this pattern is.
Hello community,
I'm interested in integrating Apache OpenDAL support into
parquet
(and potentially other crates) to enable more users to benefit from this first-class support. This integration allows users of bothparquet
andopendal
to enjoy a better experience.Do you think it would be a valuable addition?
Backgroud
OpenDAL is a unified data access layer designed to simplify interactions with various storage backends, ranging from AWS S3 to Google Drive and more. Its integration into parquet could allow opendal users to use parquet more easily and allowing parquet users to visit more storage backends.
Although it's possible to use
opendal::Reader
asAsyncRead + AsyncSeek
in the parquetParquetRecordBatchStream
, its performance isn't optimal compared toParquetObjectReader
, which directly utilizes theobject_store
API.Benefits
OpenDAL, a graduated Apache project with robust community support, boasts 22 committers and 182 contributors. This integration could potentially bring additional committers to the Arrow community.
This integration enhances the user experience for both
parquet
andopendal
. Users ofparquet
gain easy access to additional storage services, whileopendal
users can seamlessly integrate withparquet
.Plan
For implementation
Thanks to parquet & object_store perfect design, we can:
opendal
, similar to how we handleobject_store
.ParquetOpendalReader
analogous toParquetObjectReader
. Other structures could adopt the same design pattern.For maintenance
Many opendal committers heavily utilize parquet. If this proposal is accepted, the opendal community (like me) will actively maintain the OpenDAL component, including its API, documentation, tests, and CI. Additionally, the opendal community will be responsible for managing bug reports and feature requests related to OpenDAL.
Alternatives
Why not through
object_store_opendal
?While it's feasible for
opendal
to function as anobject_store::Store
viaobject_store_opendal
, this approach requires extra effort from the user and limits optimization opportunities for OpenDAL, such as selecting read methods based on if service has native seek support.Why not implement externally?
Although implementing this feature externally is possible, providing native support simplifies discovery and adoption for users.
For examples, some users have to implement their own parquet reading logic:
By contributing
ParquetOpendalReader
upstream, we can unite community efforts.