apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.5k stars 745 forks source link

Add `into_builder` methods for Arrays #6430

Open alamb opened 1 week ago

alamb commented 1 week ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. While working with various arrays and builders in DataFusion one thing that we often want to do is convert between an array and some modifiable version of it (for example, in StringView trim it would be nice to just modify existing views rather than mutate them)

I think in general it would be good to use the various Builder APIs for this

Describe the solution you'd like

I would like all the Arrays to have equivalent methods to PrimitiveArray::into_builder().

Like I want to be able to do

let array: StringViewArray = ...;
let builder = array.into_builder();
// mutate the builder (more APIs TBD)
// recreate the array
let array = builder.build()

Similarly it would be nice to do something similar with BooleanBuiler, possibly other arrays

Describe alternatives you've considered

Additional context Example DataFusion PRs: https://github.com/apache/datafusion/pull/12395

Rachelint commented 1 week ago

Yes, I think into_builder is something necessary to make in-place modification for performance. I think maybe another promising improvement in datafusion https://github.com/apache/datafusion/pull/12526 requires this ability to run even better, too.

Maybe I can help this. Or @ShiKaiWi may also be insterested about this?

ShiKaiWi commented 6 days ago

I'm glad to work on this ticket. Thanks @Rachelint.

ShiKaiWi commented 6 days ago

take.

alamb commented 6 days ago

@ShiKaiWi I would recommend:

  1. Start with a single array type (perhaps BooleanArray)
  2. Add an example of using the into_builder as a doc comment

Perhaps initially start with something like this

let arr: BooleanArray = vec![true, true, false].into();
let mut builder = arr.into_builder()
// Append 2 new rows
builder.append_value(false);
builder.append_null();
let arr = builder.finish();
assert_eq!(....)
ShiKaiWi commented 6 days ago

When trying to start from the BooleanArray, I find that the current BooleanBufferBuilder can't reuse the buffer from a sliced BooleanArray if the slicing is not aligned.

Here is the BooleanBufferBuilder's definition:

pub struct BooleanBufferBuilder {
    buffer: MutableBuffer,
    len: usize,
}

And it assumes that the boolean values must start from the first bit, but a sliced BooleanArray may skip some bits of the first byte. So I guess I have to enhance the BooleanBufferBuilder by adding an offset field to support reuse of the buffer of a sliced BooleanArray first.

@alamb @Rachelint Please let me know your opinions.

Rachelint commented 5 days ago

When trying to start from the BooleanArray, I find that the current BooleanBufferBuilder can't reuse the buffer from a sliced BooleanArray if the slicing is not aligned.

Here is the BooleanBufferBuilder's definition:

pub struct BooleanBufferBuilder {
    buffer: MutableBuffer,
    len: usize,
}

And it assumes that the boolean values must start from the first bit, but a sliced BooleanArray may skip some bits of the first byte. So I guess I have to enhance the BooleanBufferBuilder by adding an offset field to support reuse of the buffer of a sliced BooleanArray first.

@alamb @Rachelint Please let me know your opinions.

Yes, it is actually a problem for converting from BooleanArray to BooleanBuilder. And for in-place modification, I found another challenge that we can't modify bit through index using BooleanBuilder?

In my opinion, can we define something like BooleanArrayMutator? I am still reading codes and thinking.

alamb commented 5 days ago

In my opinion, can we define something like BooleanArrayMutator? I am still reading codes and thinking.

I think we can use the builders for this rather than adding a new API. I think we can use the existing structures for low level manipulation if we made it a bit easier to convert between them.

I think we should make conversion from array --> builder such that they don't copy if possible, but if necessary copy the underlying buffers. A sliced array is likely to be share so copying the bitmap to modify it will be needed anyways

And for in-place modification, I found another challenge that we can't modify bit through index using BooleanBuilder?

The lower level BooleanBufferBuilder contains the relevant APIs, like this: https://docs.rs/arrow/latest/arrow/array/builder/struct.BooleanBufferBuilder.html#method.set_bit

Rather than changing BooleanBuilder to allow direct manipulation of the underling boolean buffer, maybe we could make it easier to further destructure it (see source)

For example

// convert boolean array --> BooleanBuilder
let builder = boolean_array.into_builder();

// Deconstruct the BooleanBuilder into `BooleanBufferBuilder` and `NullBufferBuilder
let (bool_builder, null_builder) = builder.into_builders();

.. modify the boolean/null buffers via low level builders

// put it back together
let builder = BooleanBuilder::new_from_builders(bool_builder, null_builder);
let array = builder.finish()

new_from_builders is inspired by https://docs.rs/arrow/latest/arrow/array/builder/struct.PrimitiveBuilder.html#method.new_from_buffer

Rachelint commented 5 days ago

In my opinion, can we define something like BooleanArrayMutator? I am still reading codes and thinking.

I think we can use the builders for this rather than adding a new API. I think we can use the existing structures for low level manipulation if we made it a bit easier to convert between them.

I think we should make conversion from array --> builder such that they don't copy if possible, but if necessary copy the underlying buffers. A sliced array is likely to be share so copying the bitmap to modify it will be needed anyways

And for in-place modification, I found another challenge that we can't modify bit through index using BooleanBuilder?

The lower level BooleanBufferBuilder contains the relevant APIs, like this: https://docs.rs/arrow/latest/arrow/array/builder/struct.BooleanBufferBuilder.html#method.set_bit

Rather than changing BooleanBuilder to allow direct manipulation of the underling boolean buffer, maybe we could make it easier to further destructure it (see source)

For example

// convert boolean array --> BooleanBuilder
let builder = boolean_array.into_builder();

// Deconstruct the BooleanBuilder into `BooleanBufferBuilder` and `NullBufferBuilder
let (bool_builder, null_builder) = builder.into_builders();

.. modify the boolean/null buffers via low level builders

// put it back together
let builder = BooleanBuilder::new_from_builders(bool_builder, null_builder);
let array = builder.finish()

new_from_builders is inspired by https://docs.rs/arrow/latest/arrow/array/builder/struct.PrimitiveBuilder.html#method.new_from_buffer

Got it! The alternative may can be something like slices_mut in PrimitiveBuilder?

impl<T: ArrowPrimitiveType> PrimitiveBuilder<T> {
       /// Returns the current values buffer and null buffer as a slice
      pub fn slices_mut(&mut self) -> (&mut [T::Native], Option<&mut [u8]>) {
          (
              self.values_builder.as_slice_mut(),
              self.null_buffer_builder.as_slice_mut(),
          )
      }
}

We offer the similar api like:

impl BooleanBuilder {
    pub fn builders_mut(&mut self) -> (&mut BooleanBufferBuilder, &mut NullBufferBuilder) {
      ...
    }
}
Rachelint commented 5 days ago

@ShiKaiWi I read codes today, and I found maybe we can use BooleanArray::sliced to solve the offset problem. It may be not so efficient if found buffer not aligned(clone will be triggered), but can work currently.

ShiKaiWi commented 5 days ago

Thanks a lot for your responses and suggestions. @alamb @Rachelint

In conclusion, the two problems and the proposed solutions to them are as follows:

1. Whether to clone the underlying buffer for the unaligned-sliced boolean array

I think we should make conversion from array --> builder such that they don't copy if possible, but if necessary copy the underlying buffers. A sliced array is likely to be share so copying the bitmap to modify it will be needed anyways

I read codes today, and I found maybe we can use BooleanArray::slicedBooleanBuffer::sliced to solve the offset problem. It may be not so efficient if found buffer not aligned(clone will be triggered), but can work currently.

Basically, I agree with using BooleanBuffer::sliced to get the underlying buffer, which may introduce clone.

However, it makes me worried that in the implementation of PrimitiveArray::into_builder the error will be thrown if the underlying buffer is found shared, and here is the comment of it:

    /// Returns a `PrimitiveBuilder` for this array, suitable for mutating values
    /// in place.
    ///
    /// # Buffer Reuse
    ///
    /// If the underlying data buffer has no other outstanding references, the
    /// buffer is used without copying.
    ///
    /// If the underlying data buffer does have outstanding references, returns
    /// `Err(self)`
    pub fn into_builder(self) -> Result<PrimitiveBuilder<T>, Self> {
    ...
    }

And if we allow the clone for BooleanArray::into_builder the consistency of into_builder between PrimitiveArray::into_builder and BooleanArray::into_builder will be broken, and will it be confusing?

Actually, I'm in favor of totally avoiding copying for a not-shared sliced boolean array for performance and api consistency, but some breaking changes in the public api of BooleanBufferBuilder will be introduced...

So considering the api consistency, shall we insist on allowing clone for sliced BooleanArray?

2. Allow direct manipulation of the underling boolean buffer

The proposed solutions are similar -- exposing the underlying two builders. However, the provided two builders are required to be manipulated correctly, which seems not user-friendly, if some bit needs to be set/unset. And I guess it would be better if we add a new method to the BooleanBuilder:

    pub fn set_value(&mut self, index: usize, v: Option<bool>) {
    ... manipulating the `BooleanBufferBuilder` and `NullBufferBuilder`.
    }

So how about this proposal?

alamb commented 3 days ago

So considering the api consistency, shall we insist on allowing clone for sliced BooleanArray?

No, I think we should follow the model of PrimitiveArray rather than cloning implicitly

The proposed solutions are similar -- exposing the underlying two builders. However, the provided two builders are required to be manipulated correctly, which seems not user-friendly, if some bit needs to be set/unset. And I guess it would be better if we add a new method to the BooleanBuilder:

I think we should still permit converting BooleanBuilder into its two internal builders as well as creating a BooleanBuilder from the two builders - I agree it is not as user friendly but it offers the most control

Adding some additional easier to use apis like BooleanBuilder::set_value also seems reasonable to me, as we have usercases for them

So perhaps we can first focus on more easily converting back / forth between builders, and then we can figure out what additional APIs we could add to the builders to make it easier to use?