ZJONSSON / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
34 stars 61 forks source link

Top level array of strings field not working on Athena #75

Open billjohnston opened 3 years ago

billjohnston commented 3 years ago

Having trouble getting a top level array field to work with Athena

Athena create table query:

CREATE EXTERNAL TABLE IF NOT EXISTS db.table (
    id STRING, 
    ...etc
    tags ARRAY<STRING>
)
ROW FORMAT SERDE 
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://bucket'

parquetjs schema:

new ParquetSchema({
    id: { type: 'UTF8' },
    ...etc
    tags: { type: 'UTF8', repeated: true },
})

I'm able to upload and create the parquet file with this, and queries that don't include the tags field are working fine:

SELECT id from db.table 

But if I run any query that includes the tags field I get this error:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<path_to_the_parquet_file> (offset=0, length=8612):
org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO

I noticed there is a LIST field type that is supposed to work with Athena, but I'm not sure how I'd specify a top level list of strings

lwillmeth commented 2 years ago

Did you ever figure this out? I spent a couple hours today and finally came up with this schema:

new parquetJs.ParquetSchema({
    names: {
        type: 'LIST',
        fields: {
            list: {
                repeated: true,
                fields: {
                    element: {
                        type: 'UTF8'
                    }
                }
            }
        }
    }
})

I also had to prepare the data going into the names field by wrapping values with list and element, like so:

{
  names: {
    list: [
      { element: 'cathy' },
      { element: 'Zooplar the Magnificent' },
      { element: 'tim' }
    ]
}

Using this pattern, AWS Glue sees the names field as array<string>, and they show up in Athena queries.

It feels like this was a LOT harder than it should be, so if anyone knows of an easier way, please do let me know. 😅