Closed youngsofun closed 3 months ago
encounter panic when reading a column containing float values under array or map structures, such as:
array<float>
array<tuple<float, int>>
array less than expected length thread 'basic_test_nested_array_float' panicked at src/array_decoder/mod.rs:71:30: array less than expected length
Add the following code to tests/basic/data/write.py:
tests/basic/data/write.py
nested_array_float = { "value": [ [1.0, 2.0], [None, 2.0], ], } _write("struct<value:array<float>>", nested_array_float, "nested_array_float.orc")
Run the test against it like nested_array.orc
FloatIter uses number of rows instead of number of leaf values.
Refer to the code here: array_decoder/mod.rs
FloatIter do not have to guard against number of values. It can simply read until the end of bytes in memory, similar to how integers are handled.
pr: https://github.com/datafusion-contrib/datafusion-orc/pull/112
Is the Column::number_of_rows intended to represent number_of_values? If so, this code seems incorrect:
pub fn children(&self) -> Vec<Column> { .... DataType::List { child, .. } => { vec![Column { number_of_rows: self.number_of_rows, footer: self.footer.clone(), name: "item
In my understanding, we have no way to derive number of values from the parent column. however we can retrieve number of values from the stripe metadata, this requires a refactor.
Here’s an ugly but workable fix: Commit
What's wrong?
encounter panic when reading a column containing float values under array or map structures, such as:
array<float>
array<tuple<float, int>>
How to reproduce?
Add the following code to
tests/basic/data/write.py
:Run the test against it like nested_array.orc
Reason
FloatIter uses number of rows instead of number of leaf values.
Refer to the code here: array_decoder/mod.rs
Quick Fix
FloatIter do not have to guard against number of values. It can simply read until the end of bytes in memory, similar to how integers are handled.
pr: https://github.com/datafusion-contrib/datafusion-orc/pull/112
Some questions about Column::number_of_rows
Is the Column::number_of_rows intended to represent number_of_values? If so, this code seems incorrect:
In my understanding, we have no way to derive number of values from the parent column. however we can retrieve number of values from the stripe metadata, this requires a refactor.
Here’s an ugly but workable fix: Commit