Open stevenzwu opened 1 year ago
Created a new project as this is a relatively large scope overall: https://github.com/apache/iceberg/projects/27
Great design! I think we can continue adding new issues so that guys can choose the tasks they want to work on.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
@stevenzwu was wondering the status of this project. We have faced issues with the performance of the default HASH
distribution mode. This project looked promising and saw that some progress has been made with various related tasks.
@bendevera range distribution has been added to the main
branch and will be part of the next 1.7 release. you can also see the doc here: https://iceberg.apache.org/docs/nightly/flink-writes/#range-distribution-experimental
Feature Request / Improvement
Today, Flink Iceberg sink only supports simple keyBy hash distribution on partition columns. In practice, keyBy shuffle on partition values doesn't work very well.
We can make the following shuffling enhancements in Flink streaming writer. More details can be found in the design doc. This is an uber issue for tracking purpose. Here are the rough phases.
This is a case when
write.distribution-mode=hash
and there is a bucketing partition column. Other partition columns (like hourly partition) will be ignored regarding shuffling. The assumption is that bucketing column is where we want to distribute/cluster the rows.This is a case when
write.distribution-mode=hash
and there is NO bucketing partition column.This is a case when
write.distribution-mode=range
andSortOrder
is defined for non-partition columns. partition columns will be ignored for range shuffling as the assumption is that non-partition sort columns are what matter here.Query engine
Flink