facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.2k stars 1.06k forks source link

exportFlattenedVector does not support nested encoding and non-scalar types #9821

Open rui-mo opened 2 weeks ago

rui-mo commented 2 weeks ago

Description

In https://github.com/facebookincubator/velox/commit/98308143adc593fabbbf23f1b9a12c02d462fed6, flattenDictionary and flattenConstant are set as true for Parquet write, which relies on Bridge to convert Velox vector as Arrow array. When VectorFuzzer generates nested dictionary-encoded vector or non-scalar types, exporting to Arrow fails at below checks.

https://github.com/facebookincubator/velox/blob/dc561a358016692d32e34b29fcf3aa38c8fe643f/velox/vector/arrow/Bridge.cpp#L884-L889

mbasmanova commented 2 weeks ago

CC: @Yuhta

@rui-mo Does this imply that ParquetWriter cannot create files for tables with columns of type array/map/struct?

rui-mo commented 2 weeks ago

@mbasmanova I think only when the vector is dictionary-encoded, we cannot create Parquet for tables with complex types. If not, they are supported as below in Bridge. https://github.com/facebookincubator/velox/blob/e2c0014b219f30cc007a665b5722ae8d218e391a/velox/vector/arrow/Bridge.cpp#L985-L1001

mbasmanova commented 2 weeks ago

@rui-mo If Parquet writer can handle all types, but only flat encodings, then we can simply flatten data before writing to Parquet in the Fuzzer.

rui-mo commented 2 weeks ago

@mbasmanova Got it. I will try as you suggested. Thanks.