Open darlatrade opened 10 months ago
@darlatrade As I see it is taking time in the "Doing partition and writing data", it probably mean your incremental may be touching lot of file groups so it had to rewrite lot of parquet files as it is COW table. Can you check how much data got written from this stage on Spark UI?
@ad1happy2go Thanks for the reply
Here is the stage detail. Not sure where to look at exact size.
I see in .hoodie that its inserting 700KB
You can try to open this commit file and see how many file groups are being updated as part of this commit. How many partitions you have in your table?
Commit file has 16745 lines. Table has month level partitions and last commit touched almost 1 year (12) partitions. We are maintaining 3 years (36 partitions - 12 per a year). Looks like 700 file groups are updated (Found in "fileIdAndRelativePaths" section)
@darlatrade Just to weed anything out. is it easy for you to try this table on 0.14 version in a test/staging environment?
Do you have the Spark Stages UI screenshot. we can see from there how the input amplifies across different stages.
Thanks for quick reply @vinothchandar We completed most of the testing on 0.10.1. May not be able to upgrade soon. But at least I can try with 0.14 and test for this table if that helps.
Here is the stages screenshot.
hey @darlatrade : can you help w/ some more info.
if these matches your workload, and if you prefer faster write times, may be you can try MOR table.
"hoodie.metadata.index.bloom.filter.enable": "true",
"hoodie.metadata.index.bloom.filter.parallelism": 100,
"hoodie.metadata.index.bloom.filter.column.list": "id",
"hoodie.bloom.index.use.metadata": "true",
"hoodie.metadata.index.column.stats.enable": "true",
"hoodie.metadata.index.column.stats.column.list": "col1,col2,col3",
"hoodie.enable.data.skipping": "true"
if my understanding of your pipeline/workload is wrong, lets sync up in hudi OSS workspace. we can see whats going on.
@nsivabalan
Yes its COW table.
Yes your calculation is correct on file groups. We can think of MOR for future upgrades we may not able to switch right now.
Sure will remove configs and run.
got it. may I know whats your record key comprises of. I mean, I see it as "id". but is it random id or does it refer to some timestmap based keys. If its timestamp based values, we could trigger clustering based on record key and so chances that your updates are confined to lesser no of file groups per partition(but large perc of records within each file group) instead of updating very less perc among large no of file groups.
Here is how "id" is derived.
df.withColumn("id", concat("evnt_centtz",lit(""),md5(concat("key_col1","key_col2","key_col3","evnt_cent_tz"))))
Sample values from table:
@nsivabalan Any inputs on this?
yeah. As I suggested before, you may want to try our MOR table. and try using SIMPLE index. in 0.10.1 hudi uses bloom index and for random keys it might incur some unnecessary overhead.
and yeah. upgrading to 0.14.0, you can leverage RLI and that should def boost your index latencies and write latencies.
what are the hadoop configs to be considered to load 500GB of data in monthly partitions for RLI and 0.14?
well Apache Hudi 0.14 you can leverage RLI for faster UPSERT
'hoodie.metadata.record.index.enable': 'true',
'hoodie.index.type':'RECORD_INDEX'
sample code can be found https://soumilshah1995.blogspot.com/2023/10/apache-hudi-014-announces.html
I am trying to initialize new table with RLI. I need to load the history first which has 3210407531 records and 520GB data.
Spark context is shutting down to load this much data..Also number objects are huge as below screenshot
Hoodie config : "className": "org.apache.hudi", "hoodie.table.name": tgt_tbl, "hoodie.datasource.write.recordkey.field": "id", "hoodie.datasource.write.precombine.field": "eff_fm_cent_tz", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field": "year,month", "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.assume_date_partitioning": "false", "hoodie.datasource.hive_sync.table": tgt_tbl, "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.mode": "hms", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.bulkinsert.shuffle.parallelism": 700, "hoodie.index.type": "RECORD_INDEX", "hoodie.metadata.record.index.enable":"true", "hoodie.metadata.enable": "true"
Also using hudi_operation = "bulk_insert" hudi_write_mode = "overwrite"
Error:
Can you suggest what parameters need to be used to load this data? I need to load history first before starting deltas.
@darlatrade You need to increase hoodie.metadata.record.index.min.filegroup.count to a higher number depending upon the amount of data you have. Let us know if it helps. Thanks.
@darlatrade Did the suggestion worked? DO you need any other help here?
RECORD_INDEX is not working with bulk_upsert. How do we handle to load initial history. Its taking forever to load. Any recommendations for bulk_upsert to load 500GB for initial load?
Certainly! Your suggestion is indeed a great point. Considering the volume of historical data, implementing an asynchronous method for indexing could significantly improve efficiency and reduce the time required. What are your thoughts on incorporating asynchronous indexing to streamline the process?
maybe I should open a separate thread for discussing this
Yes that works
Upsert is very slow and is taking 10 to 15 mins to load 678.0 KB into HUDI COW table. Not sure where the time is taken. Can some one please help me where is the issue?
Environment Description
Hudi version : 0.10.1
Spark version : Spark 3.1.2
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Here is the Spark UI
Configs used:
"className": "org.apache.hudi", "hoodie.table.name": tgt_tbl, "hoodie.datasource.write.recordkey.field": "id", "hoodie.datasource.write.precombine.field": "evnt_cent_tz", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field": "year,month",
"hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.assume_date_partitioning": "false", "hoodie.datasource.hive_sync.table": tgt_tbl, "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.mode":"hms", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.upsert.shuffle.parallelism": 50, "hoodie.delete.shuffle.parallelism": 50, "hoodie.bulkinsert.sort.mode": "GLOBAL_SORT", "hoodie.index.type": "BLOOM", "hoodie.metadata.enable": "true", "hoodie.metadata.index.bloom.filter.enable": "true", "hoodie.metadata.index.bloom.filter.parallelism": 100, "hoodie.metadata.index.bloom.filter.column.list": "id", "hoodie.bloom.index.use.metadata": "true", "hoodie.metadata.index.column.stats.enable": "true", "hoodie.metadata.index.column.stats.column.list": "col1,col2,col3",
"hoodie.enable.data.skipping": "true"