Open Kimahriman opened 3 years ago
Only difference I can see in the codegen between parquet and delta is it looks like the parquet read adds a null check on nested
despite the read schema being non-nullable. I guess that can make sense of why delta has the issue. If nested is non-nullable and you skip the null check when processing it, but it's the only field you selected (and hence schema pruned to), the parquet reader might just give you a null value struct instead of a struct with a null value?
I have hit this issue twice now in production and finally figured out a reproduction. Basically if you schema evolve a non-nullable struct field to add a new nested field, you will get an NPE when trying to read that field:
Error:
Interestingly, the first read using parquet on the files directly works fine, or reading as delta with multiple fields, but reading as delta and only selecting that column throws the NPE. Haven't dug too much into why yet, but this would suggest it's a delta issue vs a spark issue, even though the stack trace has nothing delta related? It also only happens when the struct is non-nullable.