apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.23k stars 2.39k forks source link

[SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? #10235

Open zyclove opened 7 months ago

zyclove commented 7 months ago

Describe the problem you faced

The spark job is too slow in follow stage. Adjusting CPU, memory, and concurrency has no effect. Which stage can be optimized or skipped?

image

Is this normal? Why still use HoodieGlobalSimpleIndex? image

To Reproduce

Steps to reproduce the behavior:

  1. table config
    
    CREATE  TABLE if NOT EXISTS bi_dw_real.smart_datapoint_report_rw_clear_rt(
    id STRING COMMENT 'id',
    uuid STRING COMMENT 'log uuid',
    data_id STRING COMMENT '',
    dev_id STRING COMMENT '',
    gw_id STRING COMMENT '',
    product_id STRING COMMENT '',
    uid STRING COMMENT '',
    dp_code STRING COMMENT '',
    dp_id STRING COMMENT '',
    dp_mode STRING COMMENT ',
    dp_name STRING COMMENT '',
    dp_time STRING COMMENT '',
    dp_type STRING COMMENT '',
    dp_value STRING COMMENT '',
    gmt_modified BIGINT COMMENT 'ct 时间',
    dt STRING COMMENT '时间分区字段'
    )
    using hudi 
    PARTITIONED BY (dt,dp_mode)
    COMMENT ''
    location '${bi_db_dir}/bi_ods_real/ods_smart_datapoint_report_rw_clear_rt'
    tblproperties (
    type = 'mor',
    primaryKey = 'id',
    preCombineField = 'gmt_modified',
    hoodie.combine.before.upsert='false',
    hoodie.metadata.record.index.enable='true',
    hoodie.datasource.write.operation='upsert',
    hoodie.metadata.table='true',
    hoodie.datasource.write.hive_style_partitioning='true',
    hoodie.metadata.record.index.min.filegroup.count ='512',
    hoodie.index.type='RECORD_INDEX',
    hoodie.compact.inline='false',
    hoodie.common.spillable.diskmap.type='ROCKS_DB',
    hoodie.datasource.write.partitionpath.field='dt,dp_mode',
    hoodie.compaction.payload.class='org.apache.hudi.common.model.PartialUpdateAvroPayload'
    )
    ;

set hoodie.write.lock.zookeeper.lock_key=bi_ods_real.smart_datapoint_report_rw_clear_rt; set hoodie.storage.layout.type=DEFAULT; set hoodie.metadata.record.index.enable=true; set hoodie.metadata.table=true; set hoodie.populate.meta.fields=false; set hoodie.parquet.compression.codec=snappy; set hoodie.memory.merge.max.size=2004857600000; set hoodie.write.buffer.limit.bytes=419430400; set hoodie.index.type=RECORD_INDEX;


3.insert into bi_dw_real.smart_datapoint_report_rw_clear_rt

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version :0.14.0

* Spark version :3.2.1

* Hive version :3.1.3

* Hadoop version :3.2.2

* Storage (HDFS/S3/GCS..) :s3

* Running on Docker? (yes/no) :no
zyclove commented 7 months ago

@danny0405 why is back to GLOBAL_SIMPLE?

image

23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not initialized so falling back to GLOBAL_SIMPLE for tagging records

danny0405 commented 7 months ago

hoodie.metadata.table -> hoodie.metadata.enable

zyclove commented 7 months ago

@danny0405 With set hoodie.metadata.enable=true, now is RECORD_INDEX. But the follow stage is very very slow too.

image

image image

zyclove commented 7 months ago

SparkMetadataTableRecordIndex

fileGroupSize = hoodieTable.getMetadataTable().getNumFileGroupsForPartition(MetadataPartitionType.RECORD_INDEX); Why not 512 fileGroupSize? In addition to adjusting the number of buckets in the upstream source table, is there any other way to tune it?

image

image