facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

Parquet reader result differs from parquet-tools output when reading struct of arrays #7779

Open qqibrow opened 11 months ago

qqibrow commented 11 months ago

Bug description

result is different from parquet-tools output:

lniu@devrestricted-lniu:~/velox_parquet_test_triage/fail_parquet_files/testStructOfTwoArrays$ parquet-tools show native_parquet_reader_test1632552490045062061parquet
+---------------------------------------------------------------------------------------------------------------+
| test                                                                                                          |
|---------------------------------------------------------------------------------------------------------------|
| {'stringarrayfield': array([], dtype=object), 'intarrayfield': array([], dtype=int32)}                        |
| {'stringarrayfield': array(['0', '1'], dtype=object), 'intarrayfield': array([1], dtype=int32)}               |
| {'stringarrayfield': None, 'intarrayfield': None}                                                             |
| {'stringarrayfield': array(['2'], dtype=object), 'intarrayfield': array([], dtype=int32)}                     |
| {'stringarrayfield': array(['3', '4'], dtype=object), 'intarrayfield': array([ 3,  5,  7, 11], dtype=int32)}  |
| {'stringarrayfield': None, 'intarrayfield': None}                                                             |
| {'stringarrayfield': array(['5'], dtype=object), 'intarrayfield': array([13, 17], dtype=int32)}               |
| {'stringarrayfield': array(['6'], dtype=object), 'intarrayfield': array([], dtype=int32)}                     |
| {'stringarrayfield': None, 'intarrayfield': None}                                                             |
| {'stringarrayfield': array(['7', '8'], dtype=object), 'intarrayfield': array([1, 3, 5, 7], dtype=int32)}      |
| {'stringarrayfield': array(['9'], dtype=object), 'intarrayfield': array([11, 13, 17], dtype=int32)}           |
| {'stringarrayfield': None, 'intarrayfield': None}                                                             |
| {'stringarrayfield': array(['10'], dtype=object), 'intarrayfield': array([1, 3], dtype=int32)}                |
| {'stringarrayfield': array(['11', '12'], dtype=object), 'intarrayfield': array([5], dtype=int32)}             |
| {'stringarrayfield': None, 'intarrayfield': None}                                                             |
| {'stringarrayfield': array(['13'], dtype=object), 'intarrayfield': array([7], dtype=int32)}                   |
| {'stringarrayfield': array(['14', '15'], dtype=object), 'intarrayfield': array([11, 13], dtype=int32)}        |
| {'stringarrayfield': None, 'intarrayfield': None}                                                             |
| {'stringarrayfield': array(['16', '17'], dtype=object), 'intarrayfield': array([17], dtype=int32)}            |
| {'stringarrayfield': array(['18', '19'], dtype=object), 'intarrayfield': array([], dtype=int32)}              |
| {'stringarrayfield': None, 'intarrayfield': None}                                                             |
| {'stringarrayfield': array(['20'], dtype=object), 'intarrayfield': array([1], dtype=int32)}                   |
| {'stringarrayfield': array(['21', '22', '23', '24'], dtype=object), 'intarrayfield': array([3], dtype=int32)} |
| {'stringarrayfield': None, 'intarrayfield': None}                                                             |
| {'stringarrayfield': array(['25', '26'], dtype=object), 'intarrayfield': array([5], dtype=int32)}             |
+---------------------------------------------------------------------------------------------------------------+

velox parquet output:

lniu@devrestricted-lniu:~/velox_parquet_test_triage/fail_parquet_files/testStructOfTwoArrays$ ~/velox_scan_parquet native_parquet_reader_test1632552490045062061parquet
number of rows: 25
velox type: ROW<test:ROW<stringarrayfield:ARRAY<VARCHAR>,intarrayfield:ARRAY<INTEGER>>>
{{<empty>, <empty>}}
{{2 elements starting at 0 {null, 0}, 1 elements starting at 0 {null}}}
{{null, null}}
{{1 elements starting at 2 {1}, <empty>}}
{{2 elements starting at 3 {2, 3}, 4 elements starting at 1 {1, null, 3, 5}}}
{{null, null}}
{{1 elements starting at 5 {4}, 2 elements starting at 5 {7, 11}}}
{{1 elements starting at 6 {5}, <empty>}}
{{null, null}}
{{2 elements starting at 7 {6, 7}, 4 elements starting at 7 {13, 17, null, 1}}}
{{1 elements starting at 9 {8}, 3 elements starting at 11 {3, 5, 7}}}
{{null, null}}
{{1 elements starting at 10 {9}, 2 elements starting at 14 {11, 13}}}
{{2 elements starting at 11 {10, 11}, 1 elements starting at 16 {17}}}
{{null, null}}
{{1 elements starting at 13 {12}, 1 elements starting at 17 {1}}}
{{2 elements starting at 14 {13, 14}, 2 elements starting at 18 {3, 5}}}
{{null, null}}
{{2 elements starting at 16 {15, 16}, 1 elements starting at 20 {7}}}
{{2 elements starting at 18 {17, 18}, <empty>}}
{{null, null}}
{{1 elements starting at 20 {19}, 1 elements starting at 21 {11}}}
{{4 elements starting at 21 {20, 21, 22, 23}, 1 elements starting at 22 {13}}}
{{null, null}}
{{2 elements starting at 25 {24, 25}, 1 elements starting at 23 {17}}}

System information

Velox System Info v0.0.2 Commit: 1e186e548833750cdee4b95d829711ddad78aba1 CMake Version: 3.16.3 System: Linux-5.4.0-1063-aws Arch: x86_64 C++ Compiler: /usr/bin/c++ C++ Compiler Version: 9.4.0 C Compiler: /usr/bin/cc C Compiler Version: 9.4.0 CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

Here is the file to reproduce the issue:
https://www.dropbox.com/scl/fi/tok5hjvrzx544170guxlr/native_parquet_reader_test1632552490045062061parquet?rlkey=8yw12kzsv85ht7f9dqqqlq8bi&dl=0
yingsu00 commented 4 months ago

@qqibrow Is this still an issue?