apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.54k stars 763 forks source link

Expose an `advance` method on builders #1352

Open TimDiekmann opened 2 years ago

TimDiekmann commented 2 years ago

Given the scenario from the Fixed size list layout in the Arrow Columnar Format, I have a list of IP addresses:

For an array of length 4 with respective values:

[[192, 168, 0, 12], null, [192, 168, 0, 25], [192, 168, 0, 1]]

which will have the following representation:

* Length: 4, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  |--------------------------|-----------------------|
  | 00001101                 | 0 (padding)           |

* Values array (byte array):
  * Length: 16,  Null count: 0
  * validity bitmap buffer: Not required

    | Bytes 0-3       | Bytes 4-7   | Bytes 8-15                      |
    |-----------------|-------------|---------------------------------|
    | 192, 168, 0, 12 | unspecified | 192, 168, 0, 25, 192, 168, 0, 1 |

When using the builder-API I currently have this code:

let values: [Option<[u8; 4]>; 4] = [
    Some([192, 168, 0, 12]),
    None,
    Some([192, 168, 0, 25]),
    Some([192, 168, 0, 1]),
];

// Create the fixed-size list builder with the primitive builder inside
let mut list_builder =
    FixedSizeListBuilder::new(PrimitiveBuilder::<UInt8Type>::new(4 * 4), 4);

for value in values {
    let value_builder = list_builder.values();
    if let Some(value) = &value {
        // All fine, use the API as expected
        value_builder.append_slice(value)?;
        list_builder.append(true)?;
    } else {
        // How to advance by for without setting nulls?
        value_builder.append_nulls(4)?;
        list_builder.append(false)?;
    }
}

let array = list_builder.finish();

// Check representation as noted in the specs
assert_eq!(array.length(), 4);
assert_eq!(array.null_count(), 1);

// Check value array as noted in the specs
assert_eq!(array.values().len(), 16);
assert_eq!(array.values().null_count(), 0);

As I wrote value_builder.append_nulls(4) the value array has the nulls inside. As in the specs, the values are unspecified and the null count of the value array must be 0. In order to achieve this, I could replace it with value_builder.append_values(&[0, 0, 0, 0], &[false, false, false, false]) or do a loop, but a) this introduces unnecessary overhead because the values are not needed, b) is pretty boilerplate code, and c) isn't very easy if the type isn't known beforehand in more complex examples (like deeply nested types, etc.).

Proposed solution

If builders would have an fn advance method, then one could simply call array.advance(4), which won't set the values to null at the internal bitmap.

The BufferBuilder already has an advance method, this could probably be forwarded to the user.

alamb commented 2 years ago

I wonder if we could just add advance() to the ListBuilder interface.

I am a little worried about adding a pub advance() function to all the builders that simply leaves the values undefined as it would be fairly easy to misuse.

Maybe we ListBuilder::advance could be pub and the advance() functions on the value builders could be pub(crate) 🤔