ZJONSSON / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
34 stars 61 forks source link

Athena MAP support needs MAP_KEY_VALUE type for inner group #50

Open SeanLMcCullough opened 3 years ago

SeanLMcCullough commented 3 years ago

After experimenting around with the MAP type for Athena, it appears that the structure is not quite right.

Here is the schema output from parquet-tools for the MAP data generated by Kinesis Firehose:

  optional group my_data (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (STRING);
      optional binary value (STRING);
    }
  }

Noting the MAP_KEY_VALUE for repeated group map.

However, when generating the map data-type with this schema:

...
"my_data": {
  "type": "MAP",
  "fields": {
    "map": {
      "repeated": true,
      "fields": {
        "key": {
          "type": keyType,
          "optional": true
        },
        "value": {
          "type": valueType,
          "optional": true
        }
      }
    }
  }
}
...

The output of the library produces a schema observed by parquet-tools as such:

  optional group my_data (MAP) {
    repeated group map {
      optional binary key (STRING);
      optional binary value (STRING);
    }
  }

Note that repeated group map omits the MAP_KEY_VALUE in the schema.

This results in the AWS glue crawler seeing the two schemas differently. For the Kinesis Firehose generated data, the parsed schema by glue appears as the following: Screen Shot 2020-09-14 at 3 01 10 pm

However, the schema parsed by glue generated by this library presents the following: Screen Shot 2020-09-14 at 3 01 01 pm

I am unsure if I am using the MAP part of this library incorrectly however, as it is an undocumented feature. The structure of this schema is based off parquet files generated by a Kinesis Firehose pipeline.