Open philrz opened 1 month ago
After reviewing these symptoms in a team meeting, @nwt said it's likely this would be a bug in the Arrow Parquet writer that Zed's Parquet writer depends on.
This does look like an Arrow Parquet bug. Note the bogus min value in the page statistics here. That's the same value DuckDB complains about.
$ echo '{device_floor: 3 (uint8)}' | super -f parquet -o repro-uint8.parquet -
$ parquet pages repro-uint8.parquet
Column: device_floor
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict _ _ 1 4.00 B 4 B
0-1 data _ R 1 9.00 B 9 B 0 "4294967295" / "3"
tl;dr
The file generated by the following can be read back in with Zed tools but not some other tools.
Details
Repro is with Zed commit b05e70b.
The Parquet file generated is readable by
zq
itself.However, If I try to read the file back with DuckDB, it fails.
Likewise with Tad, which doesn't show any kind of error, just a blank screen.
https://github.com/user-attachments/assets/8d2e32b4-85cc-4484-bab2-ab4f04f3da60
As implied by the DuckDB error message, it seems the
uint8
value is the root cause, since if I stick to the default integer, the problem doesn't appear.https://github.com/user-attachments/assets/9a8bf6a1-2c8e-44c4-963a-3f415de8ba0d
Of course, even as I can show these multiple tools choking on it, the reader at https://parquetreader.com loads the one with the
uint8
without complaint, so it's hard for me to guess if this a bug or if some tools just have spotty type coverage.That said, DuckDB has no problem with its own Parquet-that-contains-
uint8
.Tad loads that one ok too.