amazon-ion / ion-hive-serde

A Apache Hive SerDe (short for serializer/deserializer) for the Ion file format.
Apache License 2.0
28 stars 12 forks source link

Documents with anonymous top level array are not properly decoded #111

Open mikereinhold opened 1 month ago

mikereinhold commented 1 month ago

According to JSON standards (RFC 4627, ECMA-404, and RFC 8259), an array is a legal top-level JSON text.

According to the Amazon Ion Hive SerDe documentation:

Because Amazon Ion is a superset of JSON, you can use the Amazon Ion Hive SerDe to query non-Amazon Ion JSON datasets.

Based on this, it is expected that JSON files with top level (anonymous) arrays should be properly understood and decoded by the Amazon Ion Hive SerDe.

For example: [{"a": "b", "b": 123, "c": true}, {"a": "z", "b": 456, "c": false}]

However the Ion Hive SerDe does not properly interpret these files:

Table definition:

CREATE EXTERNAL TABLE `top_level_array_test`(
  `array` array<struct<a:string,b:int,c:boolean>>
)
ROW FORMAT SERDE 
  'com.amazon.ionhiveserde.IonHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'ion.encoding'='TEXT', 
  'ion.fail_on_overflow'='false',
  'ion.ignore_malformed'='false'
) 
STORED AS INPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonInputFormat' 
OUTPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonOutputFormat'
LOCATION
  '...'

However this results in no query results and no input bytes to the execution engine by the SerDe: image image

In my testing, the OpenX JSON SerDe correctly handles similar data files.

rmarrowstone commented 1 month ago

Hi! It is true that Ion is a superset of JSON, but it doesn't follow that JSON Arrays should necessarily be treated as Rows/Structs by the Ion SerDe. I understand why it seems implied, but it's not a given.

We don't have any plans for active development on the Hive SerDe but other ecosystem integrations (namely Trino) are in-flight. In what engine/deployment are you using the Hive SerDe? Trino? AWS Athena? Spark? Something else?