apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.63k stars 803 forks source link

feat: add write_bytes for GenericBinaryBuilder #6652

Closed tisonkun closed 3 weeks ago

tisonkun commented 3 weeks ago

Similar to impl Write for GenericStringBuilder.

We found this helps in reducing one allocation when append bytes in many calls (compared with gather those bytes into a Vec first).

Not sure if there is some trait suitable for this method but just add it first.

cc @alamb @waynexia

The doc tests cover the basic usage.

tisonkun commented 3 weeks ago

Try impl io::Write interface. No sure if a good fit.

tisonkun commented 3 weeks ago

@andylokandy Thanks for your review! Addressed.

waynexia commented 3 weeks ago

From the API level: this looks kind of misleading to me. The proposed Write::write or write_bytes makes BinaryBuilder closer to an opaque byte container, shadowing the concept of item and offset. I.e., array built from this API would contain only one element. Plus we don't have an API to "cut" the bytes buffer, but need to call builder.append_value(""); which looks a bit hack.

How about considering something like Vec::from_raw_parts, which accepts different buffers obviously and hence avoids the potential misunderstand?

andylokandy commented 3 weeks ago

I'm the member of databend, which heavily adopts arrow-rs and I'm sure it is a very common pattern to construct an item in a StringArrayBuilder/BinaryArrayBuilder by several parts of &str or &[u8].

For example, once this PR is accepted, to add a space padding before each item of BinaryArray and generate a new BinaryArray, we can do the following:

for x in array.iter() {
    write!(&mut builder, b" ").unwrap();
    write!(&mut builder, x).unwrap();
    builder.append_value("");
}

Moreover, StringBuilder has already supported this pattern. We use it in databend smoothly and we hope BinaryBuilder can have the same ability.

waynexia commented 3 weeks ago

I'm the member of databend, which heavily adopts arrow-rs and I'm sure it is a very common pattern to construct an item in a StringArrayBuilder/BinaryArrayBuilder by several parts of &str or &[u8].

For example, to add a space padding before each item of BinaryArray and generate a new BinaryArray, we can do the following:

for x in array.iter() {
    write!(&mut builder, b" ").unwrap();
    write!(&mut builder, x).unwrap();
    builder.append_value("");
}

Moreover, StringBuilder has already supported this pattern. We use it in databend smoothly and we hope BinaryBuilder can have the same ability.

Thanks for your insight, this case makes sense to me 🙌 BTW, do you think it's necessary to encapsulate this pattern builder.append_value(""); into a dedicated method?

andylokandy commented 3 weeks ago

I'll suggest builder.commit_row();. What do you think?

alamb commented 3 weeks ago

Thank you @tisonkun and @tustvold