Parquet reader can generate incorrect validity buffer information for nested structures

bkirwi commented 2 weeks ago

Describe the bug Parquet decoding can produce an arrow array that will fail validation. In particular:

Arrays like StructArray generally expect that any nulls in non-nullable child arrays are "masked" - eg. if an the child array is null at some index, the parent array must also be.
The parquet decoders only generate nullability info for nullable fields.

This can go wrong when struct arrays are nested. Suppose we have a nested schema like: {outer: { inner: { primitive: utf8 } } }, where the primitive array is not nullable. The primitive array reader will generate a validity buffer in any case, but whether the outer structs do or not depends on whether they've themselves been declared nullable... so it's quite easy to end up with a struct with no validity buffer that has a non-nullable field of an array that does have nullability info, causing trouble.

To Reproduce I've added a failing assert on a branch: https://github.com/apache/arrow-rs/compare/master...bkirwi:arrow-rs:parquet-bug

(Hopefully it's fairly uncontroversial that that assert should pass? I've also seen this fail in other places - eg. take on nested struct arrays will run validity checks and possibly error out - if this example isn't compelling.

Expected behavior I believe that the result of decoding should be a valid array according to Arrow's internal checks.

Additional context While the failing test is a map_array test, I think this is a more general issue - maps just happen to have the mix of deep structure and mixed nullability that trigger the issue.

I am not 100% confident of my diagnosis yet, though I think the general shape is correct; I plan to follow up next week with more information or a fix.

tustvold commented 2 weeks ago

This sounds a lot like an issue I ran into on https://github.com/apache/arrow-rs/pull/4261.

My recollection is hazy, but I seem to remember this might be a limitation of the way masked nulls are computed during validation, in particular that it isn't recursive and only considers one level up.

Generating the intermediate null masks, whilst technically unnecessary, might be a way to get around this.

bkirwi commented 2 weeks ago

Ah, yeah, that does look similar.

I'd been thinking that we'd need to generate the intermediate null masks, but https://github.com/apache/arrow-rs/issues/4252 reminds me that it would be equally valid (and rather more efficient) to instead skip the null mask for the primitive array. I've pushed a small commit to the branch I linked above that does the naive thing: dropping the generated null mask instead of attaching it to the generated array data. And it does pass tests!

bkirwi commented 1 week ago

On reflection, skipping the null buffer for non-nullable fields feels best to me... both generating the intermediate null masks and dropping unnecessary null masks would fix the issue, so might as well bias to the one that involves less memory / compute.

I went ahead and opened the linked branch as a PR, though I'm happy to update it based on any other feedback on the issue here!

apache / arrow-rs

Parquet reader can generate incorrect validity buffer information for nested structures #6510