Closed okayhooni closed 1 year ago
I tested producer.override.max.request.size
option on this iceberg sink connector, but this is NOT working..!
producer.override.max.request.size: 2097152 # default: 1048576
@bryanck
Try iceberg.kafka.max.request.size
. The size of the message is somewhat concerning, are you creating many files per table per commit?
Try
iceberg.kafka.max.request.size
. The size of the message is somewhat concerning, are you creating many files per table per commit?
Thanks I will try that option!
That connector was stopped for nearly 5 hours due to a type mismatch issue. So I renamed auto-created field and recreated new field with proper type, then it started consuming again from the record offset five hours ago.
The target table is hourly partitioned table. and each hourly partition has about 30 sub-partitions. so it might try to ingest almost 150 x file(=5 hours 30 sub-partitions * x) per table per commit. (x = the s3 uploaded files count per each sub-partition path, between commit interval[=5 minutes])
How many columns does your table have, including those nested in structs? You may want to limit the columns storing stats in the metadata to decrease the size, if you have a lot of columns.
How many columns does your table have, including those nested in structs? You may want to limit the columns storing stats in the metadata to decrease the size, if you have a lot of columns.
Sure... That table has currently flattened 544 columns (and it will be increased with new 3~5 columns/day..) and I set 'write.metadata.metrics.max-inferred-column-defaults' = 700
..
If I decrease 'write.metadata.metrics.max-inferred-column-defaults'
to small values like 100
, this issue can be avoided?
You could decrease write.metadata.metrics.max-inferred-column-defaults
from the default of 100, another option is set write.metadata.metrics.default
to none
and enable metrics only on the columns you will be filtering on with write.metadata.metrics.column.*
.
You could decrease
write.metadata.metrics.max-inferred-column-defaults
from the default of 100, another option is setwrite.metadata.metrics.default
tonone
and enable metrics only on the columns you will be filtering on withwrite.metadata.metrics.column.*
.
Thanks! I will do that.
But I don't understand why this big metadata size is related to the size of record produced to control-topic by this sink connector..
Does control-topic record also has the field related to this iceberg field metadata?
Yes, the control message includes the file path as well as metadata related to the file, such as column stats. These stats are captured during the write process.
Yes, the control message includes the file path as well as metadata related to the file, such as column metrics. These metrics are captured during the write process.
Aha.. Could you explain why those column metadata also have to be ingested to control topic..?
Isn't it necessary to ensure exactly-once semantic by control topic?
The column stats are written to manifest files as part of the Iceberg commit. Iceberg uses these stats to prune the file scan list during query planning. Because the worker captures the stats during the write process, these need to be passed to the coordinator in the control message.
Thanks for kind explanation..!
Because the worker captures the stats during the write process, these need to be passed to the coordinator in the control message.
In this procedure above, is there no method to filter out unnecessary fields before ingesting to control-topic..?
You set those write properties to filter out the columns whose stats you don't care about. Then those will not be captured and not included in the control message.
Sure..!
I guess all the metadata related to column statistics is not necessary to be in control topic record.. I am wrong....?
Only the metadata configured to be captured is put in the control message, this is why I am recommending you set those write properties as that will reduce the size of the control message.
Sure..! (but.. I think.. control topic can be ingested by this connector own logic with only needed field for itself, even though there are lots of unnecessary big metadata on original message to sink)
Those full metadata is needed to ingest to control topic, because is it necessary to commit by coordinator(=consuming all the records on the control topic, and load those into Iceberg table with single commit)? Is it right..?
Then.. I eventually understand..! thanks..!
Only the column stats that are configured to be saved are captured and put in the control message. Stats for other columns are not captured and are not in the control message. Thus if you reduce the number of columns with stats, you will reduce the size of the control message.
By the way, this will also improve query planning performance, as the Iceberg metadata will be smaller as well, so it is a good idea to do this for that reason also.
By the way, this will also improve query planning performance, as the Iceberg metadata will be smaller as well, so it is a good idea to do this for that reason also.
Thanks for advice..!!
@bryanck
Could you give me some additional advice to select proper statistics option between counts
, truncate
, or full
for each column? (what full
means..)
On the official docs for Iceberg configuration, I didn't find any detailed explanation for each option.
counts
will only store count stats for a column such as null counts. full
will also store the lower/upper bounds for the column, each bounds being the full column value. truncate
likewise will store the lower/upper bounds but only the first n bytes of the value for the bounds.
You may want to open a ticket in the Iceberg repo so that can be better documented.
In terms of advice, if a column value in a data file will contain a wide range of values, or if the column is not used in filters, then having the boundaries probably isn't going to be useful as it won't help prune the file scan list. For truncate
, try to pick a number of bytes that as small as possible but still distinct enough to be useful for filtering.
Thanks for detailed answer!! :)
I got the error below on Iceberg sink connector..!
It's a producer side error related to
max.request.size
option. so I guess it is related to produce meta record to the control topic by this sink connector.How can I configure
max.request.size
option for the producer of this sink connector..? (somewhat weird..)Could it be overridden with
producer.override.max.request.size
option, as in usual producer?