confluentinc / kafka-connect-storage-common

Shared software among connectors that target distributed filesystems and cloud storage.
Other
3 stars 154 forks source link

Extend list of basic partitioner: FieldAndTimeBasedPartitioner.java & HeaderAndTimeBasedPartitioner.java #290

Open ostetsenko opened 1 year ago

ostetsenko commented 1 year ago

We use KafkaConnect to dump topics to AWS S3. Analyzing data is pretty simple with Athena + AWS Glue (Crawlers) + AWS S3. It looks like a common way for AWS users.

Problem The base problem happens when we partition by fields from the Kafka message. Athena can not create a table because parts of S3 subpath are separate columns and all Json keys are separate columns too. Two the same column names are impossible.

Solution It's a good idea to add Partitioner based on Header field & Time

Extra There is a good custom Partitioner which also can be used as default in this repo FieldAndTimeBasedPartitioner