I'm using tap-google-analytics which returns record keys using camelCase. During table creation the fields are automatically converted to lowercase by Athena so then when I run a query its looking for keys using the lower case variant and ultimately returns null for the entire column.
I was able to get it to work by rerunning the create external table statement with case.insensitive=true in the DDL I pulled from the logs which included:
WITH SERDEPROPERTIES (
'ignore.malformed.json'='true',
'case.insensitive'='false'
)
We set case.insensitive=falsein the target when using jsonl records.
Proposed Solution:
set case.insensitive=true by default for jsonl. I don't see any harm in doing this but there might be an edge case I'm missing. If there are records with the same name that are distinct by casing theyll get clobbered either during table creation or on read but either way there will be a problem.
As part of writing to jsonl files we could lowercase all field names in the record and potentially do any additional transformations needed for Athena to accept it. Other transformations could include replacing special characters or aliasing reserved column names.
I think option 1 is best to solve the short term problem and although I'm sensitive to transforming records in the target I think option 2 might be the long term solve in addition to capture other edge cases or at least send them through a filter and send a warning log or something.
I'm using tap-google-analytics which returns record keys using camelCase. During table creation the fields are automatically converted to lowercase by Athena so then when I run a query its looking for keys using the lower case variant and ultimately returns null for the entire column.
I was able to get it to work by rerunning the create external table statement with
case.insensitive=true
in the DDL I pulled from the logs which included:We set
case.insensitive=false
in the target when using jsonl records.Proposed Solution:
I think option 1 is best to solve the short term problem and although I'm sensitive to transforming records in the target I think option 2 might be the long term solve in addition to capture other edge cases or at least send them through a filter and send a warning log or something.