Closed alamb closed 1 year ago
My personal hope is that arrow2 can be donated to the ASF as a 0.x
project. Given how widely used arrow2 is, I believe that it is becoming the de-facto implementation anyway and IMHO there would be benefits to having Apache governance.
For many of us who work for large corporations, we have to seek permission to contribute to open source projects. For example, I have permission to contribute to Apache Arrow and its subprojects, but I cannot contribute to arrow2 while it is an independent project (not that I would necessarily be contributing anyway, but others may be in a similar position).
I see no reason why the stable arrow crate cannot continue to exist and evolve independently of arrow2, assuming that there are contributors motivated to do so.
For the record, I would be willing to help maintain arrow2
if it were donated into the ASF doing things like coordinating release process / voting, bug fixing, answering questions, and the like
Thank you @alamb for starting and driving this discussion. Great summary on the current community consensus.
It is not clear to me if there is a consensus on: How important the Apache Governance model is (please lend your opinions here!)
Personally, I think the Apache model works better for relatively slow moving monolith projects, while arrow2/parquet2 are fast evolving projects with a vision to be broken up into even smaller modular crates. @alamb has done an exceptional work on driving the arrow-rs releases. But seeing how much effort and time it takes, I would consider it a unnecessary overhead for arrow2 at its current stage. @jorgecarleitao was able to react to user feedbacks fast and release 3-4 new versions in a week for arrow2, this is simply not possible with the Apache Governance model. That said, I think the Apache voting process is very useful when you need high confidence on the quality of every single release and has a large diverse set of PMC members who can participate in the voting in a timely manner. But arrow2 seems still pretty far away from this.
@andygrove brought up a good point that it might become an issue for large corporations with restrictive open source contribution guide lines. This is the first time I am aware of this issue, previously I was under the impression that software license is all what matters. On the other hand, I am guessing ASF is not the only governance that's allowed? Perhaps we could help @jorgecarleitao come up with a different compatible governance model for arrow2 until it's ready for the ASF contribution? If Andy wants to contribute to arrow2 now but is blocked by lack of governance, then I would consider this a serious issue that we should address. Otherwise I would optimize for iteration velocity over governance until it becomes a real problem.
In short, from what I have seen so far, the upside from adopting the Apache governance model is to unblock potential contributions from big corporations. The downside is it will slow down our iteration process and potentially even disincentivize @jorgecarleitao from actively working on the project. Reading from his past emails, I get the feeling that he did try very hard to pass the IP clearance and donate arrow2 to ASF last year, but got frustrated by the bureaucracy. I am personally much more concerned about latter than the former.
How important the stability of APIs / the specific versioning scheme (0.x vs 1.x or later)
IMHO, this is not important as long as it is well communicated to the users. i.e. be explicit that we are special and please treat our 8.x as 0.x until we say otherwise. But Jorge has a strong opinion on this and want to strictly follow what the rest of the Rust ecosystem does. I also understand where he is coming from and respect his stance on this.
Switch datafusion to arrow2, making no changes to arrow-rs. It could be maintained by anyone who wished to contribute,
I agree with @andygrove on this. As long as there is community interests in this, we should probably still open arrow-rs up for contributions. This is not the result I want to see, but I have a feeling that this is likely what is going to happen :(
Start more actively porting the more ergonomic parts of arrow2 into arrow-rs
I think this is certainly doable, but then I stand by my previous comment that it won't be a good use of our time unless there is fundamental design tradeoffs in arrow-rs that are not compatible with arrow2's design. Simply replicating the design another project has is not a good reason to start a fork IMHO. I know @tustvold has a fairly strong opinion on this option and is more familiar with the parquet code base than I do, so perhaps he could help shed some light on this.
Option 2 leaves open the question of “how does arrow2 development move forward” – where would patches be sent, for example?
Just throwing out random idea here, one potential variant of option 2 is we use arrow-rs as the place to maintain stable arrow2 branches and let arrow2 iterate as fast as it could without the fear of introducing breaking changes. While the stable branch will cherry-pick compatible commits for a specific 0.x release that we want to maintain for X months. This way, we can still direct all contributions back to arrow2. The downside is I don't know how much interests the community has for a stable API considering we just decided to stop maintaining stable releases for arrow-rs.
I know @tustvold has a fairly strong opinion on this option and is more familiar with the parquet code base than I do, so perhaps he could help shed some light on this.
I would just like to get away from this situation where we have two concurrent projects. It is just demoralizing, draining, and to be completely honest it just seems a tad unnecessary. Whilst I do not like the idea of porting stuff across, and yes it would be an annoying use of time, I am willing to contribute to such an effort if it sees an end to this situation. It overcomes what is otherwise a potentially indefinite political and bureaucratic discussion with pure technical brute force. It is very similar in my mind to option 2, simply changing the merge direction.
Ultimately I'm going to end up porting code regardless, I would prefer an outcome that allows me to save others the same effort :smile:
I totally agree with you @tustvold . We are all engineers looking to solve interesting technical problems together, not playing political games after all.
For many of us who work for large corporations, we have to seek permission to contribute to open source projects. For example, I have permission to contribute to Apache Arrow and its subprojects, but I cannot contribute to arrow2 while it is an independent project (not that I would necessarily be contributing anyway, but others may be in a similar position).
I'd like to echo @andygrove 's point here. The only reason we're able to contribute to Arrow/Parquet rust implementation is because it's under the governance of Apache. Otherwise, it'd be very hard for us.
I would just like to get away from this situation where we have two concurrent projects. [...]
I agree. I agree that the situation is not productive. I am sorry that I caused frustration to people here.
Whilst I do not like the idea of porting stuff across, and yes it would be an annoying use of time, I am willing to contribute to such an effort if it sees an end to this situation.
I am also willing to contribute to such an effort.
What do you think about something to the effect of:
array/
, bitmap/
, offsets.rs
, types/
) are lifted to a crate living in this repo (e.g. arrow-core
or something).arrow-core
arrow-core
RunEndArray
(this is missing there atm)RecordBatch
(arrow-rs) and Chunk
(arrow2) co-exist to give room for both communities to use (in core or something else)This could result in the following changes to arrow-rs:
Vec
but no longer aligned with cache lines.from
vs from_slice
(we can switch to arrow-rs names)It would also end the arrow-arrow2 split e.g. removing the un-productive discussions around "which is better", and combine development efforts.
Some challenges:
Box<dyn Array>
as children to allow easy mutation; arrow-rs uses Arc<dyn Array>
TimestampArray
nor DecimalArray
, and instead sticks to the physical types onlyI am also willing to contribute to such an effort.
That would be fantastic, finding a way to unify our efforts would be amazing
What do you think about something to the effect of
Let me take some time to think about the implications of this and how we might do it in an incremental fashion. Whilst I have no particular affection for ArrayData
, it is a fairly useful escape hatch and is important some more advanced use-cases (Vec's allocator support is still unstable IIRC).
I am also willing to contribute to such an effort.
I am also willing to help on this. And if this would go forward consider me as invested as long term maintainer of the project as well.
Thank you @jorgecarleitao @ritchie46 and @tustvold
Arrow2 is donated to Apache Arrow and its development ceases in jorgecarleitao/arrow2
I am willing to run the "Ip clearance" process to get the arrow2 codebase into the arrow-rs repository. I have done this before for object_store and while it takes some time I think it is worth it in this case.
As for the technical plans, the only thing I feel (very) strongly is that there is a migration path that doesn't involve a "all downstream crates need to rewrite the world" -- given the above discussions, it sounds like there are already some good thoughts on this matter so I trust it will be in good hands.
I would also be interested in what some of the other recently active contributors / maintainers of arrow-rs such @viirya @askoa and @iajoiner think of these ideas
FYI @liukun4515
This is fantastic news! Thank you @jorgecarleitao and @ritchie46!
That would be fantastic, finding a way to unify our efforts would be amazing
Agree. :D
Let me take some time to think about the implications of this and how we might do it in an incremental fashion.
incremental fashion 👍
Do let me know if we need a small demo of the current arrow2 design to motivate it or something else.
One idea is to wrap arrow2 arrays in arrow-rs newtypes struct PrimitiveArray<T>(PrimitiveArrayInternal<T>);
, so that arrow-rs users continue to see (almost) the same API; internally we could glue them together via From<>
implementations - just an idea.
I am willing to run the "Ip clearance" process to get the arrow2 codebase into the arrow-rs repository. I have done this before for object_store and while it takes some time I think it is worth it in this case.
Thank you 🙇 - I will certainly help in gathering evidence regarding contributors.
As for the technical plans, the only thing I feel (very) strongly is that there is a migration path that doesn't involve a "all downstream crates need to rewrite the world"
@alamb - I agree - I believe ultimately some rewriting will have to happen as we have arrow2 and arrow-rs users with two APIs sometimes with the same name and different signatures. I do think that the way arrow-rs has been re-writing its APIs to be very conductive and productive (i.e. gradually, deprecation warnings, etc.) 🎩 tip to the team here.
@alamb Thanks for the tag. Though I can't gauge the benefits of arrow2 v arrow-rs, I agree to the notion that the change should happen incrementally to reduce the impact for downstream of both projects.
Removing ArrayData
is a huge change for arrow-rs
. Also, my impression is that ArrayData
allowed to slice arrays to different logical views on same underlying physical data. So it is possible to have many logical arrays built on top of one physical array. By removing ArrayData
we might increase memory footprint if users want different logical view of same physical data.
I, as usual, continue to pick up issues based on my availability.
it is possible to have many logical arrays built on top of one physical array. By removing ArrayData we might increase memory footprint if users want different logical view of same physical data.
This is perfectly possible in arrow2 as well. The logical types are stored by the DataType
of the physical array. Memory footprint is only dictated by physical arrays.
Thanks @alamb for pinging me. I've read @jorgecarleitao's comments (and previous comments) when it was posted yesterday. I think my feelings come from two aspects.
As a downstream crate user of arrow-rs
, ideally we hope that the unification of arrow-rs
and arrow2
wouldn't bring too much API changes, although sometimes it might be not avoidable. Based on the proposed change list and our code dependency, our impact might be slight because there are middle layers like DataFusion
so some changes will be ingested there, but I'm also wondering how much it could be on projects which are fully coupled with. At first glance at the list of change in @jorgecarleitao's comment, seems at least API stuffs could be existing for a while.
As a contributor / maintainer of arrow-rs
or an open source contributor of Rust arrow implementation, this is a fantastic news definitely. We can contribute to and benefit from others works in one single place. I'd be very pleasant to see this happens, and definitely very willing to help on it. 💪
What do people think about doing something like https://github.com/apache/arrow-rs/issues/1799 and adding a layout enumeration inside of ArrayData
consisting of the arrow2
array abstractions, or some variation thereupon? I think this should be possible without major public API changes, avoiding churning any downstreams, whilst still gaining us many of the things @jorgecarleitao mentioned above? We could then at our leisure update the arrow-rs arrays / builders to use them #1811
I'm mainly trying to avoid making changes to Array
and its implementations, as any change there will be painful.
This comment is not about how we merge arrow-rs and arrow2, just trying to outline the overall issues I observed in arrow-rs that resulted in arrow2. It may serve as an inspiration to find a solution that combines both.
Buffer
is untypedptr::read
ing a buffer requires knowing the layout it was created from. This means that the struct Buffer
cannot be semantically used. In other words, the information it carries is insufficient to use it - something else, usually its position in the ArrayData
and the DataType
, is required to use it in any way.
Buffer
logical memory region is unknownwe need to know the offset
, length
and physical type to get its logical memory region. This is stored in ArrayData. Whether we need to offset the buffer or not with ArrayData
is dependent on the physical type. How we should apply offset
is also dependent on the physical type (offsets of bitmaps are measured in bits).
Any user of ArrayData
will need to correctly apply these rules when reading any of its Buffer
s.
ArrayData
is untypedWe need to use DataType
(a logical type) to discern, at runtime, how to interpret its buffers and children. I.e. any physical typing is erased from ArrayData
when it is built.
https://github.com/apache/arrow-rs/issues/1799 is an idea to address this.
ArrayData
requires self-referencingThis is done performance reasons: since accessing an element requires runtime indirection to the buffers of ArrayData
, which is expensive, we need to self-reference them to get a good performance. Self-referencing is considered an anti-pattern in Rust.
https://github.com/apache/arrow-rs/issues/1811 is one way to address this.
Buffer
and MutableBuffer
layout is inconsistent with Vec
Most Rust uses Vec
. However, we allocate according to cache lines irrespectively of the type, resulting in a layout incompatible with Vec
. This results in our inability to easily data from Buffer
/ MutableBuffer
back to a Vec
. The other way is possible due to foreign allocated capabilities of Buffer
.
Buffer
is typedBuffer
contains an offset
and length
#[derive(Clone)]
pub struct Buffer<T> {
/// the internal byte buffer.
data: Arc<Bytes<T>>,
/// The offset into the buffer.
offset: usize,
// the length of the buffer. Given a region `data` of N elements, [offset..offset+length] is visible
// to this buffer.
length: usize,
}
Bitmap
is a first-class citizen#[derive(Clone)]
pub struct Bitmap {
bytes: Arc<Bytes<u8>>,
// both are measured in bits. They are used to bound the bitmap to a region of Bytes.
offset: usize,
length: usize,
// this is a cache: it is computed on initialization
unset_bits: usize,
}
#[derive(Clone)]
pub struct MutableBitmap {
buffer: Vec<u8>,
// invariant: length.saturating_add(7) / 8 == buffer.len();
length: usize,
}
These together are necessary and sufficient to represent all physical regions of an Arrow Array.
An array is composed by the individual parts. Examples:
#[derive(Clone)]
pub struct PrimitiveArray<T: NativeType> {
data_type: DataType,
values: Buffer<T>,
validity: Option<Bitmap>,
}
#[derive(Clone)]
pub struct BooleanArray {
data_type: DataType,
values: Bitmap,
validity: Option<Bitmap>,
}
#[derive(Clone)]
pub struct BinaryArray<O: Offset> {
data_type: DataType,
offsets: OffsetsBuffer<O>, // a newtype buffer where elements are guaranteed to be in increasing order
values: Buffer<u8>,
validity: Option<Bitmap>,
}
Because Buffer
implements Deref to Target [T]
, we can access individual values without self-referencing.
Arrow2 uses Vec
as its "native" container. In particular, it does not use MutableBuffer
and instead uses std::vec::Vec
. There is no "freeze"; instead, there is Arc<Bytes<T>>
(a small variation of Arc<Vec<T>>
to allow for foreign allocated regions).
What I am trying to say with "remove ArrayData
" is that the moment we expose such API to users, we expose a significant amount of footguns, almost all of them unsafe
. In my opinion, we should strive to remove such footguns. In my opinion, this could be done by exposing all the necessary functionality around individual arrays on the arrays themselves (or their Mutable counterparts), so that, as a user, you can only use those APIs.
For example,
impl<T: NativeType> PrimitiveArray<T> {
pub fn try_new(
data_type: DataType,
values: Buffer<T>,
validity: Option<Bitmap>,
) -> Result<Self, Error>;
}
Provides a complete construction of a PrimitiveArray
in Arrow2 - ArrayData::null_count
is the unset bits validity
, ArrayData::offset is values.offset()
.
A simpler construct (both are O(1)
):
impl<T: NativeType> PrimitiveArray<T> {
pub fn from_vec(values: Vec<T>) -> Self {
Self::try_new(T::PRIMITIVE.into(), values.into(), None).unwrap()
}
With
impl<T: NativeType> PrimitiveArray<T> {
#[must_use]
pub fn into_inner(self) -> (DataType, Buffer<T>, Option<Bitmap>) {
let Self {
data_type,
values,
validity,
} = self;
(data_type, values, validity)
}
}
a user reverts to its individual parts. Because both values and validity contain the individual offsets and physical information (via generic), Buffer
behave like bytes::Bytes
and Bitmap
like a normal container (e.g. iter()
skips by offset
and runs up to length
).
I am not saying we need to do this in arrow-rs, just pointing out that this is the appeal of arrow2 - imo it provides a very easy mental model of the data and corresponding invariants because every struct encapsulates the necessary information (via types or values) to use it regardless of its origin (e.g. using Buffer
requires information from its parent ArrayData
).
What did you think of putting the arrow2 arrays inside of ArrayData, this would achieve all of the above without changing the public array APIs, at least initially?
Edit: I've had a poke around in DataFusion and confirmed it makes fairly limited of ArrayData
directly, mostly just using it to access the null buffer, length or downcast out of an Arc. This is some empirical evidence to my hunch that we can make quite aggressive changes at this level without it being overly painful for downstreams, certainly significantly less painful than altering the array abstractions.
@tustvold do you mean in line with #1799 ? I think that that would work. :)
I do wonder: to preserve arrow-rs public API, wouldn't making arrow-rs arrays newtypes of arrow2 arrays be enough?
wouldn't making arrow-rs arrays newtypes of arrow2 arrays be enough?
Something like that is definitely the intended end state, but we need some way to get there incrementally 😅
I also have a vague hope that the new ArrayData enumeration will allow moving away from trait object downcasting, which is confusing and ergonomically unfortunate
Something like that is definitely the intended end state, but we need some way to get there incrementally 😅
Got it. 👍
I also have a vague hope that the new ArrayData enumeration will allow moving away from trait object downcasting, which is confusing and ergonomically unfortunate
Interesting! That is not something I had thought about and makes a lot of sense. 👍
From my end, your proposal: :shipit:
Wonderful, I'll make a start getting the basic abstractions in place next week, and will then create some tickets for the various migration work that will fall out of this so that we can divide and conquer on this. Hopefully by the time the IP clearance completes we will have gotten arrow-rs into a state where the arrow2 arrays can just be dropped in.
Can these new-types be added to a separate crate and thereby leaving arrow2 as an arrow-core crate? This seems like a good separation of concerns to me and will also ensure parallel compilation and probably less complexity.
Can these new-types be added to a separate crate and thereby leaving arrow2 as an arrow-core crate
I would expect the arrow2 arrays to largely replace what is currently in arrow-data, effectively arrow-data would become the arrow2 array abstractions. Any other ported functionality, e.g. IPC, would then go into the corresponding crate, e.g. arrow-ipc. I think that is what you are asking for?
Any other ported functionality, e.g. IPC, would then go into the corresponding crate, e.g. arrow-ipc. I think that is what you are asking for?
I think so. I think it is a combination of arrow-data and arrow-buffer. Maybe I am missing another one. The IO and compute can definitely be done in subcrates.
What I also meant with the newtypes are the ArrayData
abstractions. Users of current arrow2 don't use/need that abstraction, so I think it is worth adding that in a separate crate that is opt-in
.
The IO and compute can definitely be done in subcrates
Are these part of the donation? I had interpreted it as just the array abstractions, FFI and IPC?
Users of current arrow2 don't use/need
Related to the above, I'm not sure how this would work, the arrow-rs kernels, IO, etc... will likely still be in terms of arrow-rs at least for a while?
I am not very familiar with the arrow2 community, but I wonder if any of the recent committers / contributors (such as @b41sh @sundy-li @hzou, @Arty-Maly @kylebarron) have any thoughts on the discussion / proposal on this ticket
I don't know arrow-rs internals as well to know the right abstraction to merge the two projects. In general though, very excited at the prospect of focusing development efforts and stable/reliable nested parquet support.
I would be willing to help in this transition as well. I am extensively using arrow2 so am committed to hastening this transition.
I have not used arrow-rs much but with arrow2 my main pain points are some api inconsistencies among the different arrays, for example MutableUtfArray::set_validity
was not present but MutablePrimitiveArray::set_validity
existed. And the methods that are usually missed are not in the traits but are defined in the structs resulting in duplication and slight drift in logic and meaning.
I am calling it out as a way to maybe guard this functionality during the transition since some of the common methods that all arrays should do are defined on the structs themselves.
In terms of the plan Im aligned with @jorgecarleitao outlined proposal
In the interests of saving people from having to parse the entire backscroll, and to verify everyone is on the same page, here is what I have understood the consensus to be:
ArrayData
becomes an enumeration similar to the proposal in https://github.com/apache/arrow-rs/issues/1799This should mean that users of arrow-rs by will not need to change anything immediately, some functionality will gradually be deprecated and removed, but there will be no rewrite-the-world style releases and we will endeavor to minimise churn.
Once the arrow2 array abstractions are incorporated (4.) users of arrow2 will be able to gradually migrate over to the arrow-rs kernels and IO, with this getting easier once the kernels are updated to use ArrayData
instead of the arrow-rs arrays.
It isn't clear to me yet what the maintenance story for the arrow2 kernels, IO, etc... is during the transition period and afterward. I suspect they will be gradually deprecated, but this has only been implied and I'm wary of jumping to conclusions here :sweat_smile:
Edit: I intend to work on 1. this week, having some issues getting LLVM to optimise the new constructs correctly, its vectorisation logic is extraordinarily fussy, hopefully I can pacify it without too much unsafe 😅
arrow2 will maintain other IO / kernels for a deprecation time period, pointing to produce arrow-rs's arrow2 arrays. I can do it either under Apache Arrow or outside (no preference). With time, we can move relevant bits from arrow2 into arrow-rs.
One thing I am still not clear on is regarding the types
- arrow2 has a typing system based on physical types (int64), while arrow-rs is based on logical types (e.g. timestamp). I suspect that we may need to keep them both for the time being.
Arrow2 also has a different notion of a Field
- in arrow2 Field
does not have IPC-specific attributes, which makes it more idiomatic, but arrow-rs has them. We may address that when arrow2's IPC and FFI are brought into.
EDIT: a third aspect is that if ArrayData
depends on arrow-rs's DataType
, we have a major regression since arrow2
supports DataType::Extension
(essentially closing https://github.com/apache/arrow-rs/issues/2444) - this is widely used by arrow2 users, so keeping that support would be important
The arrow2 array abstractions will be combined with the buffer abstractions under ArrayData
Regarding this. The core array abstraction remain available as is right? For instance polars can keep using the arrow2 core arrays? The whole crate is designed around that + ArrayRef
. IMO the ArrayData
abstraction should be an opt-in API choice.
It isn't clear to me yet what the maintenance story for the arrow2 kernels, IO, etc... is during the transition period and afterward. I suspect they will be gradually deprecated
They are used by polars a lot. It all depends on priorities, but I will be likely supporting them for a while. I don't think this should matter as it doesn't split the community. It only means we have multiple IO engines/kernels to pick from.
The core array abstraction remain available as is right
That is my expectation, we might need to rename a few things, e.g. arrow2::Array
to PhysicalArray
or something, and add conversion to/from ArrayData for use in kernels/IO, but otherwise things should remain similar.
@alamb Thanks for pinging me. It's great to see the two projects merged so that we can all work together in one project and everyone will benefit from it. Considering that the merge is likely to be a long process with a lot of work to be done, I'm more than happy to contribute to the process.
For those following along
Together these work towards allowing us to slowly deprecate and remove the untyped MutableBuffer
and Buffer
, in favor of arrow2-style Vec
and ScalarBuffer
respectively. If people have any spare cycles any feedback on these PRs and/or the general approach would be much appreciated.
@alamb Glad to see arrow-rs and arrow2 could be merged together, I would like to contribute to help these processes.
What do people think of migrating the arrow2 arrays to be newtypes around the statically typed ArrayData once ready, i.e. much like we are doing for the arrow-rs arrays? This would have a few advantages:
I don't really see a way to avoid doing something similar to this, without a break-the-world arrow2 release, and I think bringing this forward has some compelling advantages?
Can we take PrimitiveArray
as an example?
This currently is a small wrapper around a typed Buffer<T>
and an optional bitmap Option<Bitmap>
.
pub struct PrimitiveArray<T: NativeType> {
data_type: DataType,
values: Buffer<T>,
validity: Option<Bitmap>,
}
As I understand you want to convert it to
struct PrimitiveArray<T: NativeType> {
array_data: ArrayData
phantom_data: Phantom<T>
}
enum ArrrayData {
Primitive{..},
Utf8{..},
...
}
This would cost an extra check upon random access of the array as the variant of the ArrayData
needs to be unpacked, or there needs to be used extra unsafe and hope the compiler is able to compile away those branches that will not be hit. Besides that to me it feels the wrong way around to go from known type with an untyped buffer inside. The other way around makes much more sense to me.
Besides that to me it feels the wrong way around to go from known type with an untyped buffer inside
It won't be untyped it would contain PrimitiveArrayData
?
It won't be untyped it would contain PrimitiveArrayData ?
Ah, I see. In that case it makes sense :+1: . By the look of it, it seems exactly the same as an arrow2 PrimitiveArray
? Couldn't we do with one type in that case?
Couldn't we do with one type in that case
We theoretically could, I was proposing not to for the reasons outlined in https://github.com/apache/arrow-rs/issues/1176#issuecomment-1447931537
Couldn't we do with one type in that case
We theoretically could, I was proposing not to for the reasons outlined in #1176 (comment)
I understand. For PrimitiveArray
it seems fine as the newtype seems to be exactly the same for other type I can imagine it getting a bit more complicated, so no comment yet.
for other type I can imagine it getting a bit more complicated
Perhaps you might like to check out https://github.com/apache/arrow-rs/pull/3769 which contains the remaining abstractions, and let me know what you think there?
As an update here, I am working on creating a summary of what I think the proposal is in this ticket that won't require reading all the context, and then I will create an "epic" style ticket that breaks the work down into more manageable pieces. I expect to be ready in the next few days
Here is a summary of what I think the plan is. I used Google Slides to make it easier to comment on diagrams: https://docs.google.com/presentation/d/1cqQEpC-kJES2Mng152r_qZyaOqHjtb5YFuseSTWyulU/edit#slide=id.p
Here is a copy for anyone who prefers PDF: arrow-rs + arrow2.pdf
Very much looking forward to feedback. Thank you @tustvold and @ritchie46 who helped create this summary
I plan to make the tracking ticket next week
I have filed https://github.com/jorgecarleitao/arrow2/issues/1429 with the proposal of how this could work and some alternatives. Please provide your feedback there.
Unless I hear otherwise I plan to close this particular issue in a few days and we can continue the discussion on https://github.com/jorgecarleitao/arrow2/issues/1429
TLDR: please comment on this ticket if you have opinions about if and/or how the community should unite its efforts on a single Rust implementation of Apache Arrow.
Related mailing list thread: https://lists.apache.org/thread/dsyks2ylbonhs8ngnx6529dzfyfdjjzo
There is active discussion and a PR https://github.com/apache/arrow-datafusion/pull/1556 about switching the DataFusion project to use the arrow2 Rust implementation of Arrow from @jorgecarleitao. While this DataFusion PR is not yet ready to merge, if DataFusion were to switch to
arrow2
, that leaves a question of what will happen with this (arrow-rs
) code.Since many of the PRs, contributors and maintainers of this (arrow-rs) crate are part of the DataFusion community, I believe if DataFusion switches to
arrow2
, much of the maintenance and extension efforts would followarrow2
arrow2
is largely developed by @jorgecarleitao, who is an Apache Arrow PMC member and committer, but the project itself has not been under the Apache Software Foundation’s governance. Additional background can be found on the mailing list archives and past mailing list threads such as this and thisIt is my opinion that the Rust / Arrow / DataFusion community has general consensus on:
arrow2
are more ergonomicIt is not clear to me if there is a consensus on:
0.x
vs1.x
or later)Possible ideas for a way forward:
arrow2
, making no changes toarrow-rs
. It could be maintained by anyone who wished to contribute,arrow2
code into the arrow-rs repo, with appropriate IP clearance and adopt that as the officially maintained arrow implementation (*)Option 2 leaves open the question of “how does arrow2 development move forward” – where would patches be sent, for example? I would hope we can find a way that is compatible with Apache governance, but I don't think we have a specific proposal yet, and it also depends in large part on what @jorgecarleitao is comfortable with
So, for any users of this crate not also in the DataFusion community, what are your hopes / needs / plans from this crate? How important is the apache governance to you? Please tell us your thoughts!