We use KafkaConnect to dump topics to AWS S3. Analyzing data is pretty simple with Athena + AWS Glue (Crawlers) + AWS S3. It looks like a common way for AWS users.
Problem
The base problem happens when we partition by fields from the Kafka message. Athena can not create a table because parts of S3 subpath are separate columns and all Json keys are separate columns too. Two the same column names are impossible.
Solution
It's a good idea to add Partitioner based on Header field & Time
Extra
There is a good custom Partitioner which also can be used as default in this repo FieldAndTimeBasedPartitioner
We use KafkaConnect to dump topics to AWS S3. Analyzing data is pretty simple with Athena + AWS Glue (Crawlers) + AWS S3. It looks like a common way for AWS users.
Problem The base problem happens when we partition by fields from the Kafka message. Athena can not create a table because parts of S3 subpath are separate columns and all Json keys are separate columns too. Two the same column names are impossible.
Solution It's a good idea to add Partitioner based on Header field & Time
Extra There is a good custom Partitioner which also can be used as default in this repo FieldAndTimeBasedPartitioner