[SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ?

zyclove commented 7 months ago

Describe the problem you faced

The spark job is too slow in follow stage. Adjusting CPU, memory, and concurrency has no effect. Which stage can be optimized or skipped?

Is this normal? Why still use HoodieGlobalSimpleIndex？

To Reproduce

Steps to reproduce the behavior:

table config


CREATE  TABLE if NOT EXISTS bi_dw_real.smart_datapoint_report_rw_clear_rt(
id STRING COMMENT 'id',
uuid STRING COMMENT 'log uuid',
data_id STRING COMMENT '',
dev_id STRING COMMENT '',
gw_id STRING COMMENT '',
product_id STRING COMMENT '',
uid STRING COMMENT '',
dp_code STRING COMMENT '',
dp_id STRING COMMENT '',
dp_mode STRING COMMENT ',
dp_name STRING COMMENT '',
dp_time STRING COMMENT '',
dp_type STRING COMMENT '',
dp_value STRING COMMENT '',
gmt_modified BIGINT COMMENT 'ct 时间',
dt STRING COMMENT '时间分区字段'
)
using hudi 
PARTITIONED BY (dt,dp_mode)
COMMENT ''
location '${bi_db_dir}/bi_ods_real/ods_smart_datapoint_report_rw_clear_rt'
tblproperties (
type = 'mor',
primaryKey = 'id',
preCombineField = 'gmt_modified',
hoodie.combine.before.upsert='false',
hoodie.metadata.record.index.enable='true',
hoodie.datasource.write.operation='upsert',
hoodie.metadata.table='true',
hoodie.datasource.write.hive_style_partitioning='true',
hoodie.metadata.record.index.min.filegroup.count ='512',
hoodie.index.type='RECORD_INDEX',
hoodie.compact.inline='false',
hoodie.common.spillable.diskmap.type='ROCKS_DB',
hoodie.datasource.write.partitionpath.field='dt,dp_mode',
hoodie.compaction.payload.class='org.apache.hudi.common.model.PartialUpdateAvroPayload'
)
;

set hoodie.write.lock.zookeeper.lock_key=bi_ods_real.smart_datapoint_report_rw_clear_rt; set hoodie.storage.layout.type=DEFAULT; set hoodie.metadata.record.index.enable=true; set hoodie.metadata.table=true; set hoodie.populate.meta.fields=false; set hoodie.parquet.compression.codec=snappy; set hoodie.memory.merge.max.size=2004857600000; set hoodie.write.buffer.limit.bytes=419430400; set hoodie.index.type=RECORD_INDEX;


3.insert into bi_dw_real.smart_datapoint_report_rw_clear_rt

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version :0.14.0

* Spark version :3.2.1

* Hive version :3.1.3

* Hadoop version :3.2.2

* Storage (HDFS/S3/GCS..) :s3

* Running on Docker? (yes/no) :no

zyclove commented 7 months ago

@danny0405 why is back to GLOBAL_SIMPLE?

23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not initialized so falling back to GLOBAL_SIMPLE for tagging records

danny0405 commented 7 months ago

hoodie.metadata.table -> hoodie.metadata.enable

zyclove commented 7 months ago

@danny0405 With set hoodie.metadata.enable=true, now is RECORD_INDEX. But the follow stage is very very slow too.

zyclove commented 7 months ago

SparkMetadataTableRecordIndex

fileGroupSize = hoodieTable.getMetadataTable().getNumFileGroupsForPartition(MetadataPartitionType.RECORD_INDEX); Why not 512 fileGroupSize? In addition to adjusting the number of buckets in the upstream source table, is there any other way to tune it?

apache / hudi

[SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? #10235