databricks / iceberg-kafka-connect

Apache License 2.0
220 stars 49 forks source link

Question: io.tabular.iceberg.connect.transforms.DmsTransform and iceberg.tables.default-partition-by #248

Open rwilliams-r7 opened 6 months ago

rwilliams-r7 commented 6 months ago

I have a question I am trying to use both io.tabular.iceberg.connect.transforms.DmsTransform and iceberg.tables.default-partition-by together.

Based on the format I tried to use iceberg.tables.default-partition-by=hour(_cdc.ts) this does not seem to work. Now I looked over the code and it does not seem to be able to dig into the struct of _cdc in this case.

Does the ts need to be top level?

If so when using the io.tabular.iceberg.connect.transforms.DmsTransform how have you seen this used together?

Just to add is this the same for iceberg.tables.default-id-columns" :

Is the fix that all these need to be top level and would look something like where we move the identifies to the top level in the DmsTransform: { ts id data { } metadata { } }

possible using the CopyTo Transform.

gaydba commented 6 months ago

Seems that iceberg currently doesnt support partitioning on nested fields and there is a feature request for that https://github.com/apache/iceberg/issues/8175

Also CopyValue transform doesnt support nested fields, but it could be fixed in this project. It should use something like https://github.com/tabular-io/iceberg-kafka-connect/blob/690e62e0c40480856df4b9ba1250eecb81851c18/kafka-connect/src/main/java/io/tabular/iceberg/connect/data/Utilities.java#L123C24-L123C46 instead of raw get string