apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.56k stars 767 forks source link

Error Instead of Panic On Attempting to Write More Than 32769 Row Groups #6591

Open pacman82 opened 4 days ago

pacman82 commented 4 days ago

Describe the bug

i16 counting row groups overflows and becomes negative causing panic

To Reproduce

Writing 32769 row groups with the file writer

Expected behavior

Maybe an error indicating that too many batches have been written would be preferable. Alternatively it would be nice if this just worked, yet I could also get behind the thinking that this may be too many row groups for a single file anyway.

Additional context

Occurred in the context of a user running odbc2parquet. His row groups were very small (15 rows) due to an issue with his row sizes, causing him to write lots of row groups into a single file. See: https://github.com/pacman82/odbc2parquet/issues/652

tustvold commented 4 days ago

The i16 is actually limit enforced by the parquet format itself - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L940

Row groups of this size are such a bad idea, the format actively prevents it 😅

That being said we could make this an error not a panic

pacman82 commented 3 days ago

No disagreement here. I am exploring opportunities to change the UX of odbc2parquet in a way to avoid this scenario entirely, but still felt that the panic should be an error.