Open ziudu opened 3 weeks ago
simple index doesn't have similar problem. I think the root cause might be how bucket index is implemented.
Thanks for the feedback, @KnightChess and @beyond1920 , do you have some intreast to investigate the culprit?
It seems if the table size is bigger, the data skew is worse. I noticed this issue when joining two tables and writing to a result table:
@ziudu "hoodie.bucket.index.hash.field": "trans_code" look like is enumeration value, can you determine the content of this field?
trans_code is a randomly generated uuid string: e.g. cb1d7307e4e047989955ca544e175c71 in table tb_transaction_detail there are 96,500,000 records, with 96,500,000 distinct trans_code values. Each record has a unique trans_code value.
We found a workaround, if "hoodie.index.bucket.engine": "CONSISTENT_HASHING", there will be no more data skew or spill during the write stage. Consist_hashing is slower though: with simple bucket index it took 17-18 minutes to join and write, even with data skew, while with consist hashing it took 24 minutes to join and write the same data.
The parallelism during the write stage after join is 320 (spark.sql.shuffle.partitions) for simple bucket index. The parallelism during the write stage after join is 800 (number of parquet files in the resulting table) for consistent hashing bucket index.
@ziudu yes, you are right, the data skew when writing with bulk_insert + bucket_index enabled use simple bucket index, I will submit a pr to fix it.
Describe the problem you faced
When I read a table (e.g. tb_transaction_detail) and write to another hudi table (e.g. tb_transaction_detail_bucket_index) with bulk insert and bucket index enabled, I noticed data skew during the stage "save at DatasetBulkInsertCommitActionExecutor.java:81" 。
=========================== tb_transaction_detail has 160 partitions, with 5 files in each partition. Data is evenly distributed, so each file is about 16.7 MB in size. Total size is about 13GB。
I read table tb_transaction_detail, and write to another hudi table tb_transaction_detail_bucket_index with bulk_insert and bucket index, I noticed:
the stage "save at DatasetBulkInsertCommitActionExecutor.java:81" has 800 tasks. It is normal as input table has 800 files (deduced parallelism = 800)![pic1](https://github.com/apache/hudi/assets/87431810/8739a6e3-9a16-42ac-af66-d46db6ff75a5)
Among those 800 tasks, only 224 tasks have data to process, while 576 tasks have nothing to do.
Note: some tasks have only 20MB of data to process
If sorting the tasks by "duration", I could see some tasks have 10 times more data to process (200MB+):![pic4](https://github.com/apache/hudi/assets/87431810/cee0e710-0ab2-472e-8ec7-00710d95f2c6)
However, data in the resulting table "tb_transaction_detail_bucket_index" is evenly distributed, with 160 partitions, each partition has 5 files, each file is about 16.67MB.
Is it normal to have skewed data during the write stage when bucket index is enabled? I'm expecting all 800 tasks should have some data to process. Also some tasks with largest "shuffle read size" could spill.
====================== To Reproduce
Steps to reproduce the behavior:
if name == 'main': spark = SparkSession.builder \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true") \ .config("spark.debug.maxToStringFields", "100") \ .enableHiveSupport().getOrCreate() spark.sparkContext.setLogLevel(logLevel="INFO")
Environment: spark.sql.shuffle.partitions = 320
Expected behavior
All tasks during the stage "save at DatasetBulkInsertCommitActionExecutor.java:81" should have some data to process .
Environment Description