Discussion: Switch DataFusion to using arrow2?

alamb commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Datafusion currently relies on the https://github.com/apache/arrow-rs implementation of Apache Arrow. This also means any project that is built on DataFusion is likely to end up using that implementation as well

There has been various talk / discussion / work on switching to arrow2 - https://github.com/jorgecarleitao/arrow2 from @jorgecarleitao

Describe the solution you'd like A consensus on if we want to switch datafusion to using arrow2

Additional context

Arrow2 milestone; https://github.com/apache/arrow-datafusion/milestone/3
PR with more discussion on this issue: https://github.com/apache/arrow-datafusion/pull/68
Code from @houqp and @yjshen https://github.com/houqp/arrow-datafusion/tree/arrow2_merge
@Igosuki 's PR: https://github.com/Igosuki/arrow-datafusion/tree/arrow22r
Roadmap: https://github.com/apache/arrow-datafusion/blob/0b8bffd6410ecdcfa29788f75fbc5ca15242a239/docs/source/specification/roadmap.md#runtime--infrastructure

alamb commented 2 years ago

I believe the current proposal is to make an official arrow branch in datafusion: https://github.com/apache/arrow-datafusion/pull/68#issuecomment-1007955176, which is probably a step towards switching to arrow2

thinkharderdev commented 2 years ago

Will arrow-rs eventually support async file IO? Requiring a synchronous ChuckReader is currently a major limitation in supporting alternate ObjectStores

hntd187 commented 2 years ago

I guess, what are the reasons switching would be a bad idea? Like what is the delta between what they both currently provide?

houqp commented 2 years ago

Thank you @alamb for bringing this up!

I believe the current proposal is to make an official arrow branch in datafusion: #68 (comment), which is probably a step towards switching to arrow2

Yes, this aligns with what I have in mind. The official arrow2 branch was proposed so we can close that long running PR and have a centralized location for folks to collaborate on the migration until we are happy with the master merge. If the community is happy with merging directly into master and iterate there, that would work as well.

Will arrow-rs eventually support async file IO? Requiring a synchronous ChuckReader is currently a major limitation in supporting alternate ObjectStores

I believe so. However, we could probably save this work and get it for free with the arrow2 switch.

I guess, what are the reasons switching would be a bad idea? Like what is the delta between what they both currently provide?

IMHO, the main downside is the switch cost and downstream impact. But I think it's a one time cost that's worth paying. I think arrow2 at this point should have covered most of all our needs in datafusion as demonstrated in https://github.com/apache/arrow-datafusion/pull/68. All unit and integration tests are passing at the moment.

alamb commented 2 years ago

Will arrow-rs eventually support async file IO? Requiring a synchronous ChuckReader is currently a major limitation in supporting alternate ObjectStores

I think someone just needs to put in the effort to make the current arrow-rs implementation async -- I don't know of any reason it can be done, nor do I know of anyone who plans to do so yet.

at this point should have covered most of all our needs in datafusion as demonstrated in #68. All unit and integration tests are passing at the moment.

If this is really true (I haven't checked) it certainly sounds compelling

I like the idea of making an official "arrow2" branch in DataFusion, getting some more 👀 on it, and then propose it as a PR to merge to datafusion master.

@houqp can you make a PR? Would you like me to? @yjshen ?

houqp commented 2 years ago

@houqp can you make a PR? Would you like me to? @yjshen ?

For sure, I can help create that PR :+1:

hntd187 commented 2 years ago

I guess, what are the reasons switching would be a bad idea? Like what is the delta between what they both currently provide?

IMHO, the main downside is the switch cost and downstream impact. But I think it's a one time cost that's worth paying. I think arrow2 at this point should have covered most of all our needs in datafusion as demonstrated in #68. All unit and integration tests are passing at the moment.

That's good enough for me

Igosuki commented 2 years ago

Btw the latest arrow2 commit that still has RecordBatch is https://github.com/jorgecarleitao/arrow2/commit/ef7937dfe56033c2cc491482c67587b52cd91554 it would probably be good to use that as a starting point for a temporary arrow2 fork ? That would allow me to integrate the necessary patches for some features such as decimal, without having to switch to having RecordBatch in datafusion.

jorgecarleitao commented 2 years ago

Thank you for considering using arrow2, very excited about this!

To provide some selling points, the primary goals of the repo have been:

be a place to innovate both on Arrow and on Rust
be sound and use the least amount of unsafe, pass MIRI checks, have a curated selection of dependencies
be panic-free on untrusted input to be usable in the context of web servers
be idiomatic via iterators, std::Vec, MutableArray API, Scalar API, easy to follow structs, etc.
be performant via simd, trusted len, and fast implementations
support sync and async IO with APIs that decouple blocking from non-blocking tasks
be interoperable with other formats such as Arrow, Avro, Parquet and ORC, including mandatory integration tests against corresponding reference implementations
be modular / easy to compile via feature flags over almost all functionality
support WASM
be maintainable via macros to reduce code duplication, user guide, examples and general avoidance of unsafe

atm it is the fastest implementation of Apache Parquet IO and Apache Avro IO that I can find (both read and write), both supporting sync and async executions and implemented in safe Rust (all IO in the crate is unsafe-free).

The crate is under active development, both in volume (~800 commits in a year), and also exploring different ideas, such as

(which is a major reason why it is 0.X, to allow space to try things out)

The crate has been adopted by Polars, databend, grafana's SDK for Rust and is interoperable with connectorx.

Releases have been happening about once a month (breaking), and on demand for bug fixes. The next is planned for end of this week.

I hope this offers a general idea of what is the crate and where it is heading.

tustvold commented 2 years ago

Will arrow-rs eventually support async file IO? Requiring a synchronous ChuckReader is currently a major limitation in supporting alternate ObjectStores

FWIW it would be relatively straightforward to support async IO within the context of arrow-rs. You need buffered fetching in order to get reasonable IO performance anyway, and so you just do an async fetch into a buffer and then use the sync decoders to decode it. I believe this is what arrow2 is doing anyway?? I quickly cobbled something together showing how this can be done with parquet here.

FWIW I have some optimisations to the arrow-rs parquet reader in flight that yield some pretty significant speedups https://github.com/apache/arrow-rs/pull/1054, https://github.com/apache/arrow-rs/pull/1082. And I am planning to work on dictionary preservation next which should yield orders of magnitude speedups for string dictionaries.

I would personally prefer an approach that sees the great work on arrow2 cherry-picked into arrow-rs, with arrow2 serving as an incubator for new ideas. I am happy to help out with this if there are things people would particularly like to see ported across? The current ecosystem fragmentation is just unfortunate for both users and contributors imo...

alamb commented 2 years ago

I am happy to help out with this if there are things people would particularly like to see ported across?

I have heard lots of excitement about async IO (for parquet, csv, avro, etc) and the performance of those readers.

alamb commented 2 years ago

So my summary of this ticket so far is that the next step is to get a PR up to datafusion with the most up to date code to get arrow2 working. In parallel, I will plan to start some discussions (hopefully later in the week) on the apache arrow dev list about potential ways to getarrow2 unified with arrow-rs

Looking forward to seeing a PR so we can assess how close/far we are from this goal.

Igosuki commented 2 years ago

My branch got merged into the fork so now we only need to address a few remaining issues that break tests.

houqp commented 2 years ago

I would personally prefer an approach that sees the great work on arrow2 cherry-picked into arrow-rs, with arrow2 serving as an incubator for new ideas. I am happy to help out with this if there are things people would particularly like to see ported across?

For me personally, on top of the highly optimized parquet, avro and json io modules, I really like it's transmute free design and the muttable array abstraction. The latter is the main reason why delta-rs is also in the process of migrating to arrow2.

From previous discussions in the arrow dev list, I believe Jorge tried applying his arrow2 learnings back to arrow-rs last year, but decided that it's not worth the effort because it would require basically rewriting the majority of the code base. My main concern with cherry-picking arrow2 designs into arrow-rs is that we are spending all these efforts into making arrow-rs as good as arrow2 while on the other hand we could have spent the same amount of efforts into making arrow2 even better, which will not only benefit datafusion, but a much larger community including other projects that are currently using arrow2.

IMHO, there is value in forking an open-source repo when fundamental design tradeoffs diverges. But from what I have seen so far, both arrow2 and arrow-rs contributors are pretty aligned on the direction of where an ideal arrow rust implementation should go?

The current ecosystem fragmentation is just unfortunate for both users and contributors imo...

I agree 100%. That's why I think it would be good if we can come up with a way to avoid cherry-picking commits from arrow2 into arrow-rs. Perhaps we can have arrow-rs build on top of arrow2 so they still share the majority of the code base? For example, arrow-rs could focus on providing a higher level and stable API for consumers while using arrow2 as the core. That way from contributors' point of view, it will be clear where they should send their patches to depending on which layer they work on.

houqp commented 2 years ago

As for the datafusion arrow2 branch, the PR is now available for review at https://github.com/apache/arrow-datafusion/pull/1556. I encourage everyone to:

Review the code change to get a feeling of the API UX.
Do your own performance tests with the code base and potentially downstream projects to see if the performance story actually holds up or not and whether there is any regression in workloads that haven't been tested.

tustvold commented 2 years ago

That's why I think it would be good if we can come up with a way to avoid cherry-picking commits from arrow2 into arrow-rs

Sorry, I meant more cherry-picking ideas, not actual implementation. As in you might copy across arrow-2's Buffer implementation, add a conversion to arrow-rs's Buffer implementation and then migrate the array implementations across one-by-one. Or do something similar for MutableBuffer. Ultimately the in-memory format is the same arrow spec, just getting wrapped up in different ways - the whole point of arrow is conversion between the two representations should be cheap...

I guess I've just had bad past experiences of simultaneously changing all the things at once :laughing:. Having looked at the arrow2 parquet implementation, as it is the part of the arrow-rs codebase I'm most familiar with, there is a fair amount of non-trivial functionality loss compared to arrow-rs. Some of it is esoteric things like nested structures, but also larger omissions like certain page encodings or batch size control^1. (it appears to read entire row groups into a single RecordBatch??).

This is therefore unlikely to be a strictly additive change, and I'm having a very hard time getting my head around all of its implications. That's all I really care about, that we can communicate something more than "everything may or may not be broken" :laughing:

_^1. FWIW this is the thing that makes reading parquet tricky, as pages don't delimit rows across columns or even semantic records within a column. If you just read row groups, it will be simple and fast, but recommendations are for row groups on the order of 1GB compressed so the memory footprint of such an approach is unfortunate :sweatsmile:

jorgecarleitao commented 2 years ago

I would like to thank all of you have have been working on the PR, and also to all of those that already gave it a spin. Incredibly humbled and thankful for it.

@tusvold, thanks a lot for raising these concerns, much appreciated and valid.

I agree with you that batch control is useful to avoid a very large memory footprint. I have added it as an issue on arrow2. Let me know if it captures your main point.

wrt to the encoders, I have been challenged in finding parquet writers that can write such encodings, so that we can integration-test them against when implementing them. I would be happy to add them to our CI and prove correctness and completeness of the implementation (and fix what is missing) - the process in arrow2 wrt to formats has been that we need at least one official implementation or 2 non-official implementations to confirm correctness of arrow2's implementation.

Since a comparison was made, I think that we could be fair and enumerate disadvantages and advantages of each other. For example, arrow-rs/parquet currently supports for deep nested parquet types, while arrow2 does not. Datafusion does not use them much, but there are important use-cases where they appear. Arrow-rs has a larger user-base by crate downloads and it is an official implementation. Arrow-rs also has pyarrow support out of the box (for those using pyo3), while arrow2 does not.

OTOH, arrow2 implements the complete arrow specification (under the process mentioned above), has async support to all its IO except arrow stream read, all its MIRI tests pass, its IO reading except parquet is panic free, actively implements the suggestions from Rust security WG and portable simd WG, its IO is forbid(unsafe_code), it has faster compute kernels, and its internals are simpler to understand and use, leveraging strong typed data structures.

Now, under the argument that it is the same in-memory format after all and what matters is feature completeness, I could argue that we should then create a thin FFI wrapper for the C++ implementation in Rust, abandon the Rust implementations altogether, and all contribute to the official C++.

Which brings me to my main point: imo this is not about which implementation has the fastest parquet reader or writer, it is about which code base has the most solid foundations for all of us to develop the next generation of columnar-based query engines, web applications leveraging arrow's forte, cool web-based streaming apps leveraging Arrow and Tokio, distributed query engines on AWS lambdas, etc., on a programming paradigm centred around correctness, easiness of use, and performance.

The fact that datafusion never passed MIRI checks and that it has been like this since its inception shows that these are not simple to fix issues nor the arrows' internals are sufficiently appealing for the community to fix it (at least to the unpaid ones like myself). Furthermore

despite Andrews' amazing effort in validating input data to arrays, not all arrow-rs tests pass MIRI yet (arrow2 passes, including roundtrips on all its IO and FFI components)
parquet crate is not tested under MIRI (both parquet2 and arrow2's IO are #[forbid(unsafe_code)])

With that said, is there a conclusion that the root cause for the high memory usage results from not batching parquet column chunks in smaller arrays, or is it an hypothesis that we need to test? Is that the primary concern here and it is sufficiently important to block adoption? If yes, I would gladly work towards addressing it upstream.

tustvold commented 2 years ago

I believe Andrew intends to start a separate discussion about how to unify development effort around arrow2 and arrow-rs and this particular discussion is probably better had there. Apologies for derailing this thread, I appreciate that not everyone may share the perspective that they are the same issue.

alamb commented 2 years ago

I have filed https://github.com/apache/arrow-rs/issues/1176 for a discussion on what should we do with arrow / arrow2 if datafusion switched to using arrow2.

FWIW I think the decision to switch datafusion or not should be made independently (based on whatever is best for DataFusion) but the switch I think would have major implications for arrow / arrow2

andygrove commented 2 years ago

FWIW I think the decision to switch datafusion or not should be made independently (based on whatever is best for DataFusion) but the switch I think would have major implications for arrow / arrow2

:100:

Which brings me to my main point: imo this is not about which implementation has the fastest parquet reader or writer, it is about which code base has the most solid foundations for all of us to develop the next generation of columnar-based query engines

:100:

I am not currently actively involved in development with DataFusion, but if I were, I would be offering to help with the transition to arrow2.

Now that a PR is up to move to arrow2 I will at least try and help with some testing and benchmarking. I am really excited to see this happening. :heart_eyes:

emkornfield commented 2 years ago

Cross-posting related mailing list discussion: https://lists.apache.org/thread/dsyks2ylbonhs8ngnx6529dzfyfdjjzo

matthewmturner commented 2 years ago

Given how arrow2 has fine grained controls over io features im wondering if it would make sense to pass that through to datafusion so you only have to install IO features that are needed.

Im thinking of this in the use case of using datafusion in ETL jobs where each task has its own container and may only need to use 1 or 2 file types. this could be a way to help limit container size / speed up installation.

Igosuki commented 2 years ago

That requires patching file formats with feature flags. Edit : definitely doable with the current master.

Alnaimi- commented 2 years ago

Is this still planned? Seems like there has been little movement since July (besides #1039 in September)

tustvold commented 2 years ago

I believe @v0y4g3r is working on getting the arrow2 branch updated in https://github.com/apache/arrow-datafusion/pull/2855 as part of https://github.com/apache/arrow-datafusion/issues/2709

I'm not sure what the long-term plans for this effort are, especially as the implementations continue to diverge in functionality, in both directions. I don't believe a wholesale switch is likely in the foreseeable future, it certainly isn't planned, but there have been some discussions about allowing users to mix and match arrow implementations, including arrow-gpu. Anything is possible so long as people are motivated to achieve it :smile:

alamb commented 2 years ago

Yeah -- short answer is that no one has gathered sufficient effort to get the code unified.

alamb commented 2 years ago

I think it is also important to note that many of the ideas from @jorgecarleitao in arrow2 have now been incorporated into arrow-rs

Alnaimi- commented 2 years ago

Thanks both @tustvold @alamb.

I think arrow-rs may be the safe bet for now. Especially since @jorgecarleitao has been fairly busy recently to to spearhead things. Pity, I really liked the direction of the arrow2 apis.

tustvold commented 1 year ago

An "update" on this is I intend to use https://github.com/pola-rs/polars/issues/6735 as an opportunity to explore the possibilities for improved interoperability between arrow and arrow2. I'm fairly optimistic we can make use of the FFI APIs to provide inexpensive, zero-copy conversion between the two libraries, allowing people to mix and match as desired.

alamb commented 1 year ago

Update is I believe rather than switching DataFusion to use arrow2, we are likely going to combine arrow-rs and arrow2 -- see discussions on https://github.com/apache/arrow-rs/issues/1176#issuecomment-1432616438

Let's move the discussion there

Alnaimi- commented 1 year ago

Cool. Thanks for update!

apache / datafusion

Discussion: Switch DataFusion to using arrow2? #1532