Open alamb opened 1 week ago
Yes, I think into_builder
is something necessary to make in-place modification for performance.
I think maybe another promising improvement in datafusion https://github.com/apache/datafusion/pull/12526 requires this ability to run even better, too.
Maybe I can help this. Or @ShiKaiWi may also be insterested about this?
I'm glad to work on this ticket. Thanks @Rachelint.
take.
@ShiKaiWi I would recommend:
into_builder
as a doc commentPerhaps initially start with something like this
let arr: BooleanArray = vec![true, true, false].into();
let mut builder = arr.into_builder()
// Append 2 new rows
builder.append_value(false);
builder.append_null();
let arr = builder.finish();
assert_eq!(....)
When trying to start from the BooleanArray
, I find that the current BooleanBufferBuilder
can't reuse the buffer from a sliced BooleanArray
if the slicing is not aligned.
Here is the BooleanBufferBuilder
's definition:
pub struct BooleanBufferBuilder {
buffer: MutableBuffer,
len: usize,
}
And it assumes that the boolean values must start from the first bit, but a sliced BooleanArray
may skip some bits of the first byte. So I guess I have to enhance the BooleanBufferBuilder
by adding an offset
field to support reuse of the buffer of a sliced BooleanArray
first.
@alamb @Rachelint Please let me know your opinions.
When trying to start from the
BooleanArray
, I find that the currentBooleanBufferBuilder
can't reuse the buffer from a slicedBooleanArray
if the slicing is not aligned.Here is the
BooleanBufferBuilder
's definition:pub struct BooleanBufferBuilder { buffer: MutableBuffer, len: usize, }
And it assumes that the boolean values must start from the first bit, but a sliced
BooleanArray
may skip some bits of the first byte. So I guess I have to enhance theBooleanBufferBuilder
by adding anoffset
field to support reuse of the buffer of a slicedBooleanArray
first.@alamb @Rachelint Please let me know your opinions.
Yes, it is actually a problem for converting from BooleanArray
to BooleanBuilder
.
And for in-place modification, I found another challenge that we can't modify bit through index using BooleanBuilder
?
In my opinion, can we define something like BooleanArrayMutator
? I am still reading codes and thinking.
In my opinion, can we define something like BooleanArrayMutator? I am still reading codes and thinking.
I think we can use the builders for this rather than adding a new API. I think we can use the existing structures for low level manipulation if we made it a bit easier to convert between them.
I think we should make conversion from array --> builder such that they don't copy if possible, but if necessary copy the underlying buffers. A sliced array is likely to be share so copying the bitmap to modify it will be needed anyways
And for in-place modification, I found another challenge that we can't modify bit through index using BooleanBuilder?
The lower level BooleanBufferBuilder contains the relevant APIs, like this: https://docs.rs/arrow/latest/arrow/array/builder/struct.BooleanBufferBuilder.html#method.set_bit
Rather than changing BooleanBuilder to allow direct manipulation of the underling boolean buffer, maybe we could make it easier to further destructure it (see source)
For example
// convert boolean array --> BooleanBuilder
let builder = boolean_array.into_builder();
// Deconstruct the BooleanBuilder into `BooleanBufferBuilder` and `NullBufferBuilder
let (bool_builder, null_builder) = builder.into_builders();
.. modify the boolean/null buffers via low level builders
// put it back together
let builder = BooleanBuilder::new_from_builders(bool_builder, null_builder);
let array = builder.finish()
new_from_builders is inspired by https://docs.rs/arrow/latest/arrow/array/builder/struct.PrimitiveBuilder.html#method.new_from_buffer
In my opinion, can we define something like BooleanArrayMutator? I am still reading codes and thinking.
I think we can use the builders for this rather than adding a new API. I think we can use the existing structures for low level manipulation if we made it a bit easier to convert between them.
I think we should make conversion from array --> builder such that they don't copy if possible, but if necessary copy the underlying buffers. A sliced array is likely to be share so copying the bitmap to modify it will be needed anyways
And for in-place modification, I found another challenge that we can't modify bit through index using BooleanBuilder?
The lower level BooleanBufferBuilder contains the relevant APIs, like this: https://docs.rs/arrow/latest/arrow/array/builder/struct.BooleanBufferBuilder.html#method.set_bit
Rather than changing BooleanBuilder to allow direct manipulation of the underling boolean buffer, maybe we could make it easier to further destructure it (see source)
For example
// convert boolean array --> BooleanBuilder let builder = boolean_array.into_builder(); // Deconstruct the BooleanBuilder into `BooleanBufferBuilder` and `NullBufferBuilder let (bool_builder, null_builder) = builder.into_builders(); .. modify the boolean/null buffers via low level builders // put it back together let builder = BooleanBuilder::new_from_builders(bool_builder, null_builder); let array = builder.finish()
new_from_builders is inspired by https://docs.rs/arrow/latest/arrow/array/builder/struct.PrimitiveBuilder.html#method.new_from_buffer
Got it!
The alternative may can be something like slices_mut
in PrimitiveBuilder
?
impl<T: ArrowPrimitiveType> PrimitiveBuilder<T> {
/// Returns the current values buffer and null buffer as a slice
pub fn slices_mut(&mut self) -> (&mut [T::Native], Option<&mut [u8]>) {
(
self.values_builder.as_slice_mut(),
self.null_buffer_builder.as_slice_mut(),
)
}
}
We offer the similar api like:
impl BooleanBuilder {
pub fn builders_mut(&mut self) -> (&mut BooleanBufferBuilder, &mut NullBufferBuilder) {
...
}
}
@ShiKaiWi I read codes today, and I found maybe we can use BooleanArray::sliced
to solve the offset
problem.
It may be not so efficient if found buffer not aligned(clone will be triggered), but can work currently.
Thanks a lot for your responses and suggestions. @alamb @Rachelint
In conclusion, the two problems and the proposed solutions to them are as follows:
1. Whether to clone the underlying buffer for the unaligned-sliced boolean array
I think we should make conversion from array --> builder such that they don't copy if possible, but if necessary copy the underlying buffers. A sliced array is likely to be share so copying the bitmap to modify it will be needed anyways
I read codes today, and I found maybe we can use
BooleanArray::sliced
BooleanBuffer::sliced
to solve the offset problem. It may be not so efficient if found buffer not aligned(clone will be triggered), but can work currently.
Basically, I agree with using BooleanBuffer::sliced
to get the underlying buffer, which may introduce clone.
However, it makes me worried that in the implementation of PrimitiveArray::into_builder
the error will be thrown if the underlying buffer is found shared, and here is the comment of it:
/// Returns a `PrimitiveBuilder` for this array, suitable for mutating values
/// in place.
///
/// # Buffer Reuse
///
/// If the underlying data buffer has no other outstanding references, the
/// buffer is used without copying.
///
/// If the underlying data buffer does have outstanding references, returns
/// `Err(self)`
pub fn into_builder(self) -> Result<PrimitiveBuilder<T>, Self> {
...
}
And if we allow the clone for BooleanArray::into_builder
the consistency of into_builder
between PrimitiveArray::into_builder
and BooleanArray::into_builder
will be broken, and will it be confusing?
Actually, I'm in favor of totally avoiding copying for a not-shared sliced boolean array for performance and api consistency, but some breaking changes in the public api of BooleanBufferBuilder
will be introduced...
So considering the api consistency, shall we insist on allowing clone for sliced BooleanArray
?
2. Allow direct manipulation of the underling boolean buffer
The proposed solutions are similar -- exposing the underlying two builders. However, the provided two builders are required to be manipulated correctly, which seems not user-friendly, if some bit needs to be set/unset. And I guess it would be better if we add a new method to the BooleanBuilder
:
pub fn set_value(&mut self, index: usize, v: Option<bool>) {
... manipulating the `BooleanBufferBuilder` and `NullBufferBuilder`.
}
So how about this proposal?
So considering the api consistency, shall we insist on allowing clone for sliced BooleanArray?
No, I think we should follow the model of PrimitiveArray rather than cloning implicitly
The proposed solutions are similar -- exposing the underlying two builders. However, the provided two builders are required to be manipulated correctly, which seems not user-friendly, if some bit needs to be set/unset. And I guess it would be better if we add a new method to the BooleanBuilder:
I think we should still permit converting BooleanBuilder
into its two internal builders as well as creating a BooleanBuilder from the two builders - I agree it is not as user friendly but it offers the most control
Adding some additional easier to use apis like BooleanBuilder::set_value
also seems reasonable to me, as we have usercases for them
So perhaps we can first focus on more easily converting back / forth between builders, and then we can figure out what additional APIs we could add to the builders to make it easier to use?
Is your feature request related to a problem or challenge? Please describe what you are trying to do. While working with various arrays and builders in DataFusion one thing that we often want to do is convert between an array and some modifiable version of it (for example, in StringView trim it would be nice to just modify existing views rather than mutate them)
I think in general it would be good to use the various Builder APIs for this
Describe the solution you'd like
I would like all the Arrays to have equivalent methods to
PrimitiveArray::into_builder()
.Like I want to be able to do
Similarly it would be nice to do something similar with
BooleanBuiler
, possibly other arraysDescribe alternatives you've considered
Additional context Example DataFusion PRs: https://github.com/apache/datafusion/pull/12395