amazon-ion / ion-hive-serde

A Apache Hive SerDe (short for serializer/deserializer) for the Ion file format.
Apache License 2.0
28 stars 12 forks source link

Fixes an IonStructToMap object inspector NPE #91

Closed cheqianh closed 2 years ago

cheqianh commented 2 years ago

Description:

This PR adds a null check for IonStructToMap object inspector to avoid NPE.

This PR is consisted of two commits, the first one (91736de) contains unit tests and the second one (a226202) includes the fix.

Issue:

Ion-hive-serde Struct to Map conversion throws NPE for null value.

More specifically, ion-path-extractor only matches the top-level container instead of Ion values within the nested container with default configuration. So the null values within the nested container are not filtered and are still treated as Ion null other than primitive null as we expect here (This is kind of like the case insensitive issue we saw earlier) so that Ion case insensitive decorator will later convert it back to primitive null here when iterate the struct. Therefor in the for loop, null.getFieldName() throws NPE.

Solution in this PR

Since Hive doesn't know how to parse Ion null so we need to add primitive null to the struct. Currently, we filter Ion null in deserializer before asking object inspector to do any conversion. But object inspector should be able to detect Ion null as it passed the final Map to upper layer (E.g. presto/hive). So In this PR, I added a Null check in object inspector to make sure all values are parsed correctly.

The solution follows three points below.

  1. Don't change ion-path-extractor's current behavior. We may change ion-path-extractor to filter null values within a nested struct but in this PR we leave path-extractor as is.
  2. To avoid any confusion, Ion container case insensitive decorator should behavior the same as regular ion container without any null check logic, specifically for iterator. When decorator's iterator encounter an Ion null, it should returns Ion null rather than null.
  3. Ion object inspector should be able to check null type independently. It's ok to have other methods to filter null but object inspectors should make sure all values are validated before passing them to hive/presto.

I added null check for Ion Map to Struct, once it's addressed. I'll add null check for other struct object inspectors.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cheqianh commented 2 years ago

Description:

We are currently removing null value from the top-level container here since Hive can't handle IonNull. However this code chunk doesn't work as expected - we need to keep IonNull values as primitive null instead of removing them from the container.

This is a good opportunity to remove unnecessary null check logic and rely on object inspectors to filter Ion null. (See Solution in the PR bullet point 3 above. We should rely on object inspector to centrally filter null values to avoid potential human error. We should not rely on developers to handle all null types in different methods.)

Changes since the last review are:

An overview of changes including all below 5 commits is here.

1845a62 improved unit tests style for easier debugging. d2bc9dc, 1fa0a61 and 9836da9 added null type check for Ion container object inspectors and their unit tests.
80f0f25 makes object inspector's get element methods also return primitive null for IonNull. I don't know where these methods will be used, but I kept them behavior the same as before to avoid potential issue.

cheqianh commented 2 years ago

The unit test for the initial NPE is here

BEFORE

Two cases,

  1. Regular Ion struct. It fails as it returns IonNull.

    Screen Shot 2022-09-07 at 5 58 11 PM
  2. CaseInsensitiveDecorator wrapped Ion struct. It throws NPE.

    Screen Shot 2022-09-07 at 5 59 06 PM

AFTER

GHA passed.

cheqianh commented 2 years ago

Overview for changes since last review is available here