Closed rjdp closed 1 year ago
This connector consumes Debezium events, they have a transaction timestamp that is used by this connector for partitioning key, with current implementation there is no possiblity to configure which field is used for partitioning but you could write Kafka Connect transformation to put value of that field in the ts
field.
https://github.com/getindata/kafka-connect-iceberg-sink#iceberg-partitioning-support
I'm also looking to use this connector without Debezium, and having a configuration option for setting the partitioning field would be great.
@johanhenriksson I dont recently have time to do some implementation here but I would be happy to do the review.
Cool, I might give it a shot. My java is super rusty though. Just wanted to let you know that the feature would be appreciated :)
Trying to work around it for now but it doesn't seem to work. From the readme it seems events should have a timestamp on this format
{
"sourceOffset": {
"ts_ms": 123,
},
// other fields...
}
But looking at the source, it looks like it should be
{
"__source_ts_ms": 123,
// other fields...
}
or
{
"__source_ts": 123,
// other fields...
}
Could you clarify where to put the timestamp? :)
@johanhenriksson It should be in the first format you presented, the reason why in source it looks like that it is that before Sink there is an unwrap transformation:
transforms: unwrap
transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.add.fields: op,table,source.ts_ms,db
To be more specific it should be
{
"source" : {
"ts_ms": 123
}
}
Thanks for the quick reply! It still doesn't seem to work however. I'm not sure if it could be because im changing an already existing table?
The transform is defined on the debezium source, and I'm not using that. Is it still the correct format? Im trying to write events directly to a topic and have this connector write them to iceberg.
@johanhenriksson do you create full debezium event with schema and so on? Please take a look at this blog post https://getindata.com/blog/real-time-ingestion-iceberg-kafka-connect-apache-iceberg-sink/ in second part there is an example on creating debezium events with Python code.
How can we use partition timestamp field other than message ingest time, I want to use mongo object Id , which I am able to convert using a custom udf and create a stream on it.
suppose this is my stream backed by topic "k1"
I would like to use the
ts
field for partitioning