alamb commented 2 years ago

TLDR: please comment on this ticket if you have opinions about if and/or how the community should unite its efforts on a single Rust implementation of Apache Arrow.

Related mailing list thread: https://lists.apache.org/thread/dsyks2ylbonhs8ngnx6529dzfyfdjjzo

There is active discussion and a PR https://github.com/apache/arrow-datafusion/pull/1556 about switching the DataFusion project to use the arrow2 Rust implementation of Arrow from @jorgecarleitao. While this DataFusion PR is not yet ready to merge, if DataFusion were to switch to arrow2, that leaves a question of what will happen with this (arrow-rs) code.

Since many of the PRs, contributors and maintainers of this (arrow-rs) crate are part of the DataFusion community, I believe if DataFusion switches to arrow2, much of the maintenance and extension efforts would follow arrow2

arrow2is largely developed by @jorgecarleitao, who is an Apache Arrow PMC member and committer, but the project itself has not been under the Apache Software Foundation’s governance. Additional background can be found on the mailing list archives and past mailing list threads such as this and this

It is my opinion that the Rust / Arrow / DataFusion community has general consensus on:

Having one implementation of Arrow in Rust where we can focus would be better than 2 which split attention and resources
The technical underpinnings of arrow2 are more ergonomic

It is not clear to me if there is a consensus on:

How important the Apache Governance model is (please lend your opinions here!)
How important the stability of APIs / the specific versioning scheme (0.x vs 1.x or later)

Possible ideas for a way forward:

Switch datafusion to arrow2, making no changes to arrow-rs. It could be maintained by anyone who wished to contribute,
Bring arrow2 code into the arrow-rs repo, with appropriate IP clearance and adopt that as the officially maintained arrow implementation (*)
Start more actively porting the more ergonomic parts of arrow2 into arrow-rs to reduce the feature gap as suggested in https://github.com/apache/arrow-datafusion/issues/1532#issuecomment-1012985001 by @tustvold
Others?

Option 2 leaves open the question of “how does arrow2 development move forward” – where would patches be sent, for example? I would hope we can find a way that is compatible with Apache governance, but I don't think we have a specific proposal yet, and it also depends in large part on what @jorgecarleitao is comfortable with

So, for any users of this crate not also in the DataFusion community, what are your hopes / needs / plans from this crate? How important is the apache governance to you? Please tell us your thoughts!

andygrove commented 2 years ago

My personal hope is that arrow2 can be donated to the ASF as a 0.x project. Given how widely used arrow2 is, I believe that it is becoming the de-facto implementation anyway and IMHO there would be benefits to having Apache governance.

For many of us who work for large corporations, we have to seek permission to contribute to open source projects. For example, I have permission to contribute to Apache Arrow and its subprojects, but I cannot contribute to arrow2 while it is an independent project (not that I would necessarily be contributing anyway, but others may be in a similar position).

I see no reason why the stable arrow crate cannot continue to exist and evolve independently of arrow2, assuming that there are contributors motivated to do so.

alamb commented 2 years ago

For the record, I would be willing to help maintain arrow2 if it were donated into the ASF doing things like coordinating release process / voting, bug fixing, answering questions, and the like

houqp commented 2 years ago

Thank you @alamb for starting and driving this discussion. Great summary on the current community consensus.

It is not clear to me if there is a consensus on: How important the Apache Governance model is (please lend your opinions here!)

Personally, I think the Apache model works better for relatively slow moving monolith projects, while arrow2/parquet2 are fast evolving projects with a vision to be broken up into even smaller modular crates. @alamb has done an exceptional work on driving the arrow-rs releases. But seeing how much effort and time it takes, I would consider it a unnecessary overhead for arrow2 at its current stage. @jorgecarleitao was able to react to user feedbacks fast and release 3-4 new versions in a week for arrow2, this is simply not possible with the Apache Governance model. That said, I think the Apache voting process is very useful when you need high confidence on the quality of every single release and has a large diverse set of PMC members who can participate in the voting in a timely manner. But arrow2 seems still pretty far away from this.

@andygrove brought up a good point that it might become an issue for large corporations with restrictive open source contribution guide lines. This is the first time I am aware of this issue, previously I was under the impression that software license is all what matters. On the other hand, I am guessing ASF is not the only governance that's allowed? Perhaps we could help @jorgecarleitao come up with a different compatible governance model for arrow2 until it's ready for the ASF contribution? If Andy wants to contribute to arrow2 now but is blocked by lack of governance, then I would consider this a serious issue that we should address. Otherwise I would optimize for iteration velocity over governance until it becomes a real problem.

In short, from what I have seen so far, the upside from adopting the Apache governance model is to unblock potential contributions from big corporations. The downside is it will slow down our iteration process and potentially even disincentivize @jorgecarleitao from actively working on the project. Reading from his past emails, I get the feeling that he did try very hard to pass the IP clearance and donate arrow2 to ASF last year, but got frustrated by the bureaucracy. I am personally much more concerned about latter than the former.

How important the stability of APIs / the specific versioning scheme (0.x vs 1.x or later)

IMHO, this is not important as long as it is well communicated to the users. i.e. be explicit that we are special and please treat our 8.x as 0.x until we say otherwise. But Jorge has a strong opinion on this and want to strictly follow what the rest of the Rust ecosystem does. I also understand where he is coming from and respect his stance on this.

Switch datafusion to arrow2, making no changes to arrow-rs. It could be maintained by anyone who wished to contribute,

I agree with @andygrove on this. As long as there is community interests in this, we should probably still open arrow-rs up for contributions. This is not the result I want to see, but I have a feeling that this is likely what is going to happen :(

Start more actively porting the more ergonomic parts of arrow2 into arrow-rs

I think this is certainly doable, but then I stand by my previous comment that it won't be a good use of our time unless there is fundamental design tradeoffs in arrow-rs that are not compatible with arrow2's design. Simply replicating the design another project has is not a good reason to start a fork IMHO. I know @tustvold has a fairly strong opinion on this option and is more familiar with the parquet code base than I do, so perhaps he could help shed some light on this.

Option 2 leaves open the question of “how does arrow2 development move forward” – where would patches be sent, for example?

Just throwing out random idea here, one potential variant of option 2 is we use arrow-rs as the place to maintain stable arrow2 branches and let arrow2 iterate as fast as it could without the fear of introducing breaking changes. While the stable branch will cherry-pick compatible commits for a specific 0.x release that we want to maintain for X months. This way, we can still direct all contributions back to arrow2. The downside is I don't know how much interests the community has for a stable API considering we just decided to stop maintaining stable releases for arrow-rs.

tustvold commented 2 years ago

I know @tustvold has a fairly strong opinion on this option and is more familiar with the parquet code base than I do, so perhaps he could help shed some light on this.

I would just like to get away from this situation where we have two concurrent projects. It is just demoralizing, draining, and to be completely honest it just seems a tad unnecessary. Whilst I do not like the idea of porting stuff across, and yes it would be an annoying use of time, I am willing to contribute to such an effort if it sees an end to this situation. It overcomes what is otherwise a potentially indefinite political and bureaucratic discussion with pure technical brute force. It is very similar in my mind to option 2, simply changing the merge direction.

Ultimately I'm going to end up porting code regardless, I would prefer an outcome that allows me to save others the same effort :smile:

houqp commented 2 years ago

I totally agree with you @tustvold . We are all engineers looking to solve interesting technical problems together, not playing political games after all.

sunchao commented 2 years ago

For many of us who work for large corporations, we have to seek permission to contribute to open source projects. For example, I have permission to contribute to Apache Arrow and its subprojects, but I cannot contribute to arrow2 while it is an independent project (not that I would necessarily be contributing anyway, but others may be in a similar position).

I'd like to echo @andygrove 's point here. The only reason we're able to contribute to Arrow/Parquet rust implementation is because it's under the governance of Apache. Otherwise, it'd be very hard for us.

jorgecarleitao commented 1 year ago

I would just like to get away from this situation where we have two concurrent projects. [...]

I agree. I agree that the situation is not productive. I am sorry that I caused frustration to people here.

Whilst I do not like the idea of porting stuff across, and yes it would be an annoying use of time, I am willing to contribute to such an effort if it sees an end to this situation.

I am also willing to contribute to such an effort.

What do you think about something to the effect of:

Arrow2 is donated to Apache Arrow and its development ceases in jorgecarleitao/arrow2
The core of arrow2 (array/, bitmap/, offsets.rs, types/) are lifted to a crate living in this repo (e.g. arrow-core or something).
the core receives relevant methods from arrow-rs; add methods existing in arrow-rs with "deprecated" to give time for arrow-rs users to use them.
arrow-rs' FFI of arrow2 is moved to a separate crate and replaces arrow-rs' one
Arrow-rs' compute is migrated to use arrow-core
Arrow-rs' IO except IPC is migrated to use arrow-core
Arrow-rs' IPC IO is replaced by arrow2's implementation with necessary adjustments
arrow-core will add RunEndArray (this is missing there atm)
RecordBatch (arrow-rs) and Chunk (arrow2) co-exist to give room for both communities to use (in core or something else)
Development and governance follows Apache and this repo of community-driven development.

This could result in the following changes to arrow-rs:

ArrayData is removed
It becomes interoperable with Vec but no longer aligned with cache lines.
IPC support is improved with e.g. unsafe free, big endian support, mmap of IPC files
There is code churn related to from vs from_slice (we can switch to arrow-rs names)
slicing of arrays become easier (handling of offsets)

It would also end the arrow-arrow2 split e.g. removing the un-productive discussions around "which is better", and combine development efforts.

Some challenges:

Arrow2 uses Box<dyn Array> as children to allow easy mutation; arrow-rs uses Arc<dyn Array>
Arrow2 does not have TimestampArray nor DecimalArray, and instead sticks to the physical types only

tustvold commented 1 year ago

I am also willing to contribute to such an effort.

That would be fantastic, finding a way to unify our efforts would be amazing

What do you think about something to the effect of

Let me take some time to think about the implications of this and how we might do it in an incremental fashion. Whilst I have no particular affection for ArrayData, it is a fairly useful escape hatch and is important some more advanced use-cases (Vec's allocator support is still unstable IIRC).

ritchie46 commented 1 year ago

I am also willing to contribute to such an effort.

I am also willing to help on this. And if this would go forward consider me as invested as long term maintainer of the project as well.

alamb commented 1 year ago

Thank you @jorgecarleitao @ritchie46 and @tustvold

Arrow2 is donated to Apache Arrow and its development ceases in jorgecarleitao/arrow2

I am willing to run the "Ip clearance" process to get the arrow2 codebase into the arrow-rs repository. I have done this before for object_store and while it takes some time I think it is worth it in this case.

As for the technical plans, the only thing I feel (very) strongly is that there is a migration path that doesn't involve a "all downstream crates need to rewrite the world" -- given the above discussions, it sounds like there are already some good thoughts on this matter so I trust it will be in good hands.

I would also be interested in what some of the other recently active contributors / maintainers of arrow-rs such @viirya @askoa and @iajoiner think of these ideas

FYI @liukun4515

andygrove commented 1 year ago

This is fantastic news! Thank you @jorgecarleitao and @ritchie46!

jorgecarleitao commented 1 year ago

That would be fantastic, finding a way to unify our efforts would be amazing

Agree. :D

Let me take some time to think about the implications of this and how we might do it in an incremental fashion.

incremental fashion 👍

Do let me know if we need a small demo of the current arrow2 design to motivate it or something else.

One idea is to wrap arrow2 arrays in arrow-rs newtypes struct PrimitiveArray<T>(PrimitiveArrayInternal<T>);, so that arrow-rs users continue to see (almost) the same API; internally we could glue them together via From<> implementations - just an idea.

I am willing to run the "Ip clearance" process to get the arrow2 codebase into the arrow-rs repository. I have done this before for object_store and while it takes some time I think it is worth it in this case.

Thank you 🙇 - I will certainly help in gathering evidence regarding contributors.

As for the technical plans, the only thing I feel (very) strongly is that there is a migration path that doesn't involve a "all downstream crates need to rewrite the world"

@alamb - I agree - I believe ultimately some rewriting will have to happen as we have arrow2 and arrow-rs users with two APIs sometimes with the same name and different signatures. I do think that the way arrow-rs has been re-writing its APIs to be very conductive and productive (i.e. gradually, deprecation warnings, etc.) 🎩 tip to the team here.

askoa commented 1 year ago

@alamb Thanks for the tag. Though I can't gauge the benefits of arrow2 v arrow-rs, I agree to the notion that the change should happen incrementally to reduce the impact for downstream of both projects.

Removing ArrayData is a huge change for arrow-rs. Also, my impression is that ArrayData allowed to slice arrays to different logical views on same underlying physical data. So it is possible to have many logical arrays built on top of one physical array. By removing ArrayData we might increase memory footprint if users want different logical view of same physical data.

I, as usual, continue to pick up issues based on my availability.

ritchie46 commented 1 year ago

it is possible to have many logical arrays built on top of one physical array. By removing ArrayData we might increase memory footprint if users want different logical view of same physical data.

This is perfectly possible in arrow2 as well. The logical types are stored by the DataType of the physical array. Memory footprint is only dictated by physical arrays.

viirya commented 1 year ago

Thanks @alamb for pinging me. I've read @jorgecarleitao's comments (and previous comments) when it was posted yesterday. I think my feelings come from two aspects.

As a downstream crate user of arrow-rs, ideally we hope that the unification of arrow-rs and arrow2 wouldn't bring too much API changes, although sometimes it might be not avoidable. Based on the proposed change list and our code dependency, our impact might be slight because there are middle layers like DataFusion so some changes will be ingested there, but I'm also wondering how much it could be on projects which are fully coupled with. At first glance at the list of change in @jorgecarleitao's comment, seems at least API stuffs could be existing for a while.

As a contributor / maintainer of arrow-rs or an open source contributor of Rust arrow implementation, this is a fantastic news definitely. We can contribute to and benefit from others works in one single place. I'd be very pleasant to see this happens, and definitely very willing to help on it. 💪

tustvold commented 1 year ago

What do people think about doing something like https://github.com/apache/arrow-rs/issues/1799 and adding a layout enumeration inside of ArrayData consisting of the arrow2 array abstractions, or some variation thereupon? I think this should be possible without major public API changes, avoiding churning any downstreams, whilst still gaining us many of the things @jorgecarleitao mentioned above? We could then at our leisure update the arrow-rs arrays / builders to use them #1811

I'm mainly trying to avoid making changes to Array and its implementations, as any change there will be painful.

jorgecarleitao commented 1 year ago

This comment is not about how we merge arrow-rs and arrow2, just trying to outline the overall issues I observed in arrow-rs that resulted in arrow2. It may serve as an inspiration to find a solution that combines both.

1. `Buffer` is untyped

ptr::reading a buffer requires knowing the layout it was created from. This means that the struct Buffer cannot be semantically used. In other words, the information it carries is insufficient to use it - something else, usually its position in the ArrayData and the DataType, is required to use it in any way.

2. `Buffer` logical memory region is unknown

we need to know the offset, length and physical type to get its logical memory region. This is stored in ArrayData. Whether we need to offset the buffer or not with ArrayData is dependent on the physical type. How we should apply offset is also dependent on the physical type (offsets of bitmaps are measured in bits).

Any user of ArrayData will need to correctly apply these rules when reading any of its Buffers.

3. `ArrayData` is untyped

We need to use DataType (a logical type) to discern, at runtime, how to interpret its buffers and children. I.e. any physical typing is erased from ArrayData when it is built.

https://github.com/apache/arrow-rs/issues/1799 is an idea to address this.

4. Efficient use of `ArrayData` requires self-referencing

This is done performance reasons: since accessing an element requires runtime indirection to the buffers of ArrayData, which is expensive, we need to self-reference them to get a good performance. Self-referencing is considered an anti-pattern in Rust.

https://github.com/apache/arrow-rs/issues/1811 is one way to address this.

5. `Buffer` and `MutableBuffer` layout is inconsistent with `Vec`

Most Rust uses Vec. However, we allocate according to cache lines irrespectively of the type, resulting in a layout incompatible with Vec. This results in our inability to easily data from Buffer / MutableBuffer back to a Vec. The other way is possible due to foreign allocated capabilities of Buffer.

How arrow2 solves these

Problem 1. and 2.

Buffer is typed
Buffer contains an offset and length

#[derive(Clone)]
pub struct Buffer<T> {
    /// the internal byte buffer.
    data: Arc<Bytes<T>>,

    /// The offset into the buffer.
    offset: usize,

    // the length of the buffer. Given a region `data` of N elements, [offset..offset+length] is visible
    // to this buffer.
    length: usize,
}

Bitmap is a first-class citizen

#[derive(Clone)]
pub struct Bitmap {
    bytes: Arc<Bytes<u8>>,
    // both are measured in bits. They are used to bound the bitmap to a region of Bytes.
    offset: usize,
    length: usize,
    // this is a cache: it is computed on initialization
    unset_bits: usize,
}

#[derive(Clone)]
pub struct MutableBitmap {
    buffer: Vec<u8>,
    // invariant: length.saturating_add(7) / 8 == buffer.len();
    length: usize,
}

These together are necessary and sufficient to represent all physical regions of an Arrow Array.

Problem 3. and 4.

An array is composed by the individual parts. Examples:

#[derive(Clone)]
pub struct PrimitiveArray<T: NativeType> {
    data_type: DataType,
    values: Buffer<T>,
    validity: Option<Bitmap>,
}

#[derive(Clone)]
pub struct BooleanArray {
    data_type: DataType,
    values: Bitmap,
    validity: Option<Bitmap>,
}

#[derive(Clone)]
pub struct BinaryArray<O: Offset> {
    data_type: DataType,
    offsets: OffsetsBuffer<O>,  // a newtype buffer where elements are guaranteed to be in increasing order
    values: Buffer<u8>,
    validity: Option<Bitmap>,
}

Because Buffer implements Deref to Target [T], we can access individual values without self-referencing.

Problem 5.

Arrow2 uses Vec as its "native" container. In particular, it does not use MutableBuffer and instead uses std::vec::Vec. There is no "freeze"; instead, there is Arc<Bytes<T>> (a small variation of Arc<Vec<T>> to allow for foreign allocated regions).

Wrap up

What I am trying to say with "remove ArrayData" is that the moment we expose such API to users, we expose a significant amount of footguns, almost all of them unsafe. In my opinion, we should strive to remove such footguns. In my opinion, this could be done by exposing all the necessary functionality around individual arrays on the arrays themselves (or their Mutable counterparts), so that, as a user, you can only use those APIs.

For example,

impl<T: NativeType> PrimitiveArray<T> {
  pub fn try_new(
          data_type: DataType,
          values: Buffer<T>,
          validity: Option<Bitmap>,
      ) -> Result<Self, Error>;
}

Provides a complete construction of a PrimitiveArray in Arrow2 - ArrayData::null_count is the unset bits validity, ArrayData::offset is values.offset().

A simpler construct (both are O(1)):

impl<T: NativeType> PrimitiveArray<T> {
    pub fn from_vec(values: Vec<T>) -> Self {
        Self::try_new(T::PRIMITIVE.into(), values.into(), None).unwrap()
    }

With

impl<T: NativeType> PrimitiveArray<T> {
    #[must_use]
    pub fn into_inner(self) -> (DataType, Buffer<T>, Option<Bitmap>) {
        let Self {
            data_type,
            values,
            validity,
        } = self;
        (data_type, values, validity)
    }
}

a user reverts to its individual parts. Because both values and validity contain the individual offsets and physical information (via generic), Buffer behave like bytes::Bytes and Bitmap like a normal container (e.g. iter() skips by offset and runs up to length).

I am not saying we need to do this in arrow-rs, just pointing out that this is the appeal of arrow2 - imo it provides a very easy mental model of the data and corresponding invariants because every struct encapsulates the necessary information (via types or values) to use it regardless of its origin (e.g. using Buffer requires information from its parent ArrayData).

tustvold commented 1 year ago

What did you think of putting the arrow2 arrays inside of ArrayData, this would achieve all of the above without changing the public array APIs, at least initially?

Edit: I've had a poke around in DataFusion and confirmed it makes fairly limited of ArrayData directly, mostly just using it to access the null buffer, length or downcast out of an Arc. This is some empirical evidence to my hunch that we can make quite aggressive changes at this level without it being overly painful for downstreams, certainly significantly less painful than altering the array abstractions.

jorgecarleitao commented 1 year ago

@tustvold do you mean in line with #1799 ? I think that that would work. :)

I do wonder: to preserve arrow-rs public API, wouldn't making arrow-rs arrays newtypes of arrow2 arrays be enough?

tustvold commented 1 year ago

wouldn't making arrow-rs arrays newtypes of arrow2 arrays be enough?

Something like that is definitely the intended end state, but we need some way to get there incrementally 😅

I also have a vague hope that the new ArrayData enumeration will allow moving away from trait object downcasting, which is confusing and ergonomically unfortunate

jorgecarleitao commented 1 year ago

Something like that is definitely the intended end state, but we need some way to get there incrementally 😅

Got it. 👍

I also have a vague hope that the new ArrayData enumeration will allow moving away from trait object downcasting, which is confusing and ergonomically unfortunate

Interesting! That is not something I had thought about and makes a lot of sense. 👍

From my end, your proposal: :shipit:

tustvold commented 1 year ago

Wonderful, I'll make a start getting the basic abstractions in place next week, and will then create some tickets for the various migration work that will fall out of this so that we can divide and conquer on this. Hopefully by the time the IP clearance completes we will have gotten arrow-rs into a state where the arrow2 arrays can just be dropped in.

ritchie46 commented 1 year ago

Can these new-types be added to a separate crate and thereby leaving arrow2 as an arrow-core crate? This seems like a good separation of concerns to me and will also ensure parallel compilation and probably less complexity.

tustvold commented 1 year ago

Can these new-types be added to a separate crate and thereby leaving arrow2 as an arrow-core crate

I would expect the arrow2 arrays to largely replace what is currently in arrow-data, effectively arrow-data would become the arrow2 array abstractions. Any other ported functionality, e.g. IPC, would then go into the corresponding crate, e.g. arrow-ipc. I think that is what you are asking for?

ritchie46 commented 1 year ago

Any other ported functionality, e.g. IPC, would then go into the corresponding crate, e.g. arrow-ipc. I think that is what you are asking for?

I think so. I think it is a combination of arrow-data and arrow-buffer. Maybe I am missing another one. The IO and compute can definitely be done in subcrates.

What I also meant with the newtypes are the ArrayData abstractions. Users of current arrow2 don't use/need that abstraction, so I think it is worth adding that in a separate crate that is opt-in.

tustvold commented 1 year ago

The IO and compute can definitely be done in subcrates

Are these part of the donation? I had interpreted it as just the array abstractions, FFI and IPC?

Users of current arrow2 don't use/need

Related to the above, I'm not sure how this would work, the arrow-rs kernels, IO, etc... will likely still be in terms of arrow-rs at least for a while?

alamb commented 1 year ago

I am not very familiar with the arrow2 community, but I wonder if any of the recent committers / contributors (such as @b41sh @sundy-li @hzou, @Arty-Maly @kylebarron) have any thoughts on the discussion / proposal on this ticket

kylebarron commented 1 year ago

I don't know arrow-rs internals as well to know the right abstraction to merge the two projects. In general though, very excited at the prospect of focusing development efforts and stable/reliable nested parquet support.

Arty-Maly commented 1 year ago

I would be willing to help in this transition as well. I am extensively using arrow2 so am committed to hastening this transition.

I have not used arrow-rs much but with arrow2 my main pain points are some api inconsistencies among the different arrays, for example MutableUtfArray::set_validity was not present but MutablePrimitiveArray::set_validity existed. And the methods that are usually missed are not in the traits but are defined in the structs resulting in duplication and slight drift in logic and meaning.

I am calling it out as a way to maybe guard this functionality during the transition since some of the common methods that all arrays should do are defined on the structs themselves.

In terms of the plan Im aligned with @jorgecarleitao outlined proposal

tustvold commented 1 year ago

In the interests of saving people from having to parse the entire backscroll, and to verify everyone is on the same page, here is what I have understood the consensus to be:

arrow-rs will be migrated to use strongly-typed buffer abstractions under ArrayData, i.e. ArrayData becomes an enumeration similar to the proposal in https://github.com/apache/arrow-rs/issues/1799
The weakly typed ArrayData APIs will be gradually deprecated and removed, in favour of new strongly-typed APIs
arrow2 will be donated to Apache Arrow
The arrow2 array abstractions will be combined with the buffer abstractions under ArrayData
The arrow-rs kernels and IO will be updated to use ArrayData instead of arrow-rs arrays, this will be done preserving backwards compatibility (e.g. impl AsRef`)
The arrow2 IPC and FFI will be merged with the arrow-rs implementation

This should mean that users of arrow-rs by will not need to change anything immediately, some functionality will gradually be deprecated and removed, but there will be no rewrite-the-world style releases and we will endeavor to minimise churn.

Once the arrow2 array abstractions are incorporated (4.) users of arrow2 will be able to gradually migrate over to the arrow-rs kernels and IO, with this getting easier once the kernels are updated to use ArrayData instead of the arrow-rs arrays.

It isn't clear to me yet what the maintenance story for the arrow2 kernels, IO, etc... is during the transition period and afterward. I suspect they will be gradually deprecated, but this has only been implied and I'm wary of jumping to conclusions here :sweat_smile:

Edit: I intend to work on 1. this week, having some issues getting LLVM to optimise the new constructs correctly, its vectorisation logic is extraordinarily fussy, hopefully I can pacify it without too much unsafe 😅

jorgecarleitao commented 1 year ago

arrow2 will maintain other IO / kernels for a deprecation time period, pointing to produce arrow-rs's arrow2 arrays. I can do it either under Apache Arrow or outside (no preference). With time, we can move relevant bits from arrow2 into arrow-rs.

One thing I am still not clear on is regarding the types - arrow2 has a typing system based on physical types (int64), while arrow-rs is based on logical types (e.g. timestamp). I suspect that we may need to keep them both for the time being.

Arrow2 also has a different notion of a Field - in arrow2 Field does not have IPC-specific attributes, which makes it more idiomatic, but arrow-rs has them. We may address that when arrow2's IPC and FFI are brought into.

EDIT: a third aspect is that if ArrayData depends on arrow-rs's DataType, we have a major regression since arrow2 supports DataType::Extension (essentially closing https://github.com/apache/arrow-rs/issues/2444) - this is widely used by arrow2 users, so keeping that support would be important

ritchie46 commented 1 year ago

The arrow2 array abstractions will be combined with the buffer abstractions under ArrayData

Regarding this. The core array abstraction remain available as is right? For instance polars can keep using the arrow2 core arrays? The whole crate is designed around that + ArrayRef. IMO the ArrayData abstraction should be an opt-in API choice.

It isn't clear to me yet what the maintenance story for the arrow2 kernels, IO, etc... is during the transition period and afterward. I suspect they will be gradually deprecated

They are used by polars a lot. It all depends on priorities, but I will be likely supporting them for a while. I don't think this should matter as it doesn't split the community. It only means we have multiple IO engines/kernels to pick from.

tustvold commented 1 year ago

The core array abstraction remain available as is right

That is my expectation, we might need to rename a few things, e.g. arrow2::Array to PhysicalArray or something, and add conversion to/from ArrayData for use in kernels/IO, but otherwise things should remain similar.

b41sh commented 1 year ago

@alamb Thanks for pinging me. It's great to see the two projects merged so that we can all work together in one project and everyone will benefit from it. Considering that the merge is likely to be a long process with a lot of work to be done, I'm more than happy to contribute to the process.

tustvold commented 1 year ago

For those following along

https://github.com/apache/arrow-rs/pull/3743 migrates the arrow-rs arrays to use typed buffer abstractions as a precursor to migrating them to use strongly typed ArrayData
https://github.com/apache/arrow-rs/pull/3749 starts adding strongly typed ArrayData abstractions
https://github.com/apache/arrow-rs/pull/3756 adds zero-copy conversion between Vec and Buffer

Together these work towards allowing us to slowly deprecate and remove the untyped MutableBuffer and Buffer, in favor of arrow2-style Vec and ScalarBuffer respectively. If people have any spare cycles any feedback on these PRs and/or the general approach would be much appreciated.

sundy-li commented 1 year ago

@alamb Glad to see arrow-rs and arrow2 could be merged together, I would like to contribute to help these processes.

tustvold commented 1 year ago

What do people think of migrating the arrow2 arrays to be newtypes around the statically typed ArrayData once ready, i.e. much like we are doing for the arrow-rs arrays? This would have a few advantages:

Work can start on merging the codebases without having to wait for IP clearance, I hope to have a first cut of the ArrayData ready for the release in just under two weeks
Allows for incremental changes to arrow2, in much the same vein as we are doing for arrow-rs
We can find potential integration issues sooner
We can immediately divide and conquer
Avoids issues with overloaded method and type names, etc... each codebase can keep its current array abstractions as is, gradually migrating as/when desired

I don't really see a way to avoid doing something similar to this, without a break-the-world arrow2 release, and I think bringing this forward has some compelling advantages?

ritchie46 commented 1 year ago

Can we take PrimitiveArray as an example?

This currently is a small wrapper around a typed Buffer<T> and an optional bitmap Option<Bitmap>.

pub struct PrimitiveArray<T: NativeType> {
    data_type: DataType,
    values: Buffer<T>,
    validity: Option<Bitmap>,
}

As I understand you want to convert it to

struct PrimitiveArray<T: NativeType> {
  array_data: ArrayData
  phantom_data: Phantom<T>
}

enum ArrrayData {
   Primitive{..},
   Utf8{..},
   ...
}

This would cost an extra check upon random access of the array as the variant of the ArrayData needs to be unpacked, or there needs to be used extra unsafe and hope the compiler is able to compile away those branches that will not be hit. Besides that to me it feels the wrong way around to go from known type with an untyped buffer inside. The other way around makes much more sense to me.

tustvold commented 1 year ago

Besides that to me it feels the wrong way around to go from known type with an untyped buffer inside

It won't be untyped it would contain PrimitiveArrayData ?

ritchie46 commented 1 year ago

It won't be untyped it would contain PrimitiveArrayData ?

Ah, I see. In that case it makes sense :+1: . By the look of it, it seems exactly the same as an arrow2 PrimitiveArray? Couldn't we do with one type in that case?

tustvold commented 1 year ago

Couldn't we do with one type in that case

We theoretically could, I was proposing not to for the reasons outlined in https://github.com/apache/arrow-rs/issues/1176#issuecomment-1447931537

ritchie46 commented 1 year ago

Couldn't we do with one type in that case

We theoretically could, I was proposing not to for the reasons outlined in #1176 (comment)

I understand. For PrimitiveArray it seems fine as the newtype seems to be exactly the same for other type I can imagine it getting a bit more complicated, so no comment yet.

tustvold commented 1 year ago

for other type I can imagine it getting a bit more complicated

Perhaps you might like to check out https://github.com/apache/arrow-rs/pull/3769 which contains the remaining abstractions, and let me know what you think there?

alamb commented 1 year ago

As an update here, I am working on creating a summary of what I think the proposal is in this ticket that won't require reading all the context, and then I will create an "epic" style ticket that breaks the work down into more manageable pieces. I expect to be ready in the next few days

alamb commented 1 year ago

Here is a summary of what I think the plan is. I used Google Slides to make it easier to comment on diagrams: https://docs.google.com/presentation/d/1cqQEpC-kJES2Mng152r_qZyaOqHjtb5YFuseSTWyulU/edit#slide=id.p

Here is a copy for anyone who prefers PDF: arrow-rs + arrow2.pdf

Very much looking forward to feedback. Thank you @tustvold and @ritchie46 who helped create this summary

I plan to make the tracking ticket next week

alamb commented 1 year ago

I have filed https://github.com/jorgecarleitao/arrow2/issues/1429 with the proposal of how this could work and some alternatives. Please provide your feedback there.

Unless I hear otherwise I plan to close this particular issue in a few days and we can continue the discussion on https://github.com/jorgecarleitao/arrow2/issues/1429

apache / arrow-rs

Discussion: relationship / unification of arrow-rs and arrow2 going forward #1176

1. `Buffer` is untyped

2. `Buffer` logical memory region is unknown

3. `ArrayData` is untyped

4. Efficient use of `ArrayData` requires self-referencing

5. `Buffer` and `MutableBuffer` layout is inconsistent with `Vec`

How arrow2 solves these

Problem 1. and 2.

Problem 3. and 4.

Problem 5.

Wrap up

apache / arrow-rs

Discussion: relationship / unification of arrow-rs and arrow2 going forward #1176

1. Buffer is untyped

2. Buffer logical memory region is unknown

3. ArrayData is untyped

4. Efficient use of ArrayData requires self-referencing

5. Buffer and MutableBuffer layout is inconsistent with Vec

How arrow2 solves these

Problem 1. and 2.

Problem 3. and 4.

Problem 5.

Wrap up

1. `Buffer` is untyped

2. `Buffer` logical memory region is unknown

3. `ArrayData` is untyped

4. Efficient use of `ArrayData` requires self-referencing

5. `Buffer` and `MutableBuffer` layout is inconsistent with `Vec`