datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
41 stars 11 forks source link

Bug: Fail to read column of type `array<float>` #111

Closed youngsofun closed 3 months ago

youngsofun commented 3 months ago

What's wrong?

encounter panic when reading a column containing float values under array or map structures, such as:

array less than expected length
thread 'basic_test_nested_array_float' panicked at src/array_decoder/mod.rs:71:30:
array less than expected length

How to reproduce?

Add the following code to tests/basic/data/write.py:

nested_array_float = {
    "value": [
        [1.0, 2.0],
        [None, 2.0],
    ],
}

_write("struct<value:array<float>>", nested_array_float, "nested_array_float.orc")

Run the test against it like nested_array.orc

Reason

FloatIter uses number of rows instead of number of leaf values.

Refer to the code here: array_decoder/mod.rs

Quick Fix

FloatIter do not have to guard against number of values. It can simply read until the end of bytes in memory, similar to how integers are handled.

pr: https://github.com/datafusion-contrib/datafusion-orc/pull/112

Some questions about Column::number_of_rows

Is the Column::number_of_rows intended to represent number_of_values? If so, this code seems incorrect:

pub fn children(&self) -> Vec<Column> {
    .... 
    DataType::List { child, .. } => {
        vec![Column {
            number_of_rows: self.number_of_rows,
            footer: self.footer.clone(),
            name: "item

In my understanding, we have no way to derive number of values from the parent column. however we can retrieve number of values from the stripe metadata, this requires a refactor.

Here’s an ugly but workable fix: Commit