apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

[SUPPORT] Schema Evolution in MR Reading of Hudi Causes Metadata List Request for Each Split #11723

Open muyihao opened 1 month ago

muyihao commented 1 month ago

Tips before filing an issue

Describe the problem you faced HUDI-5000 introduced schema evolution for Hive reading Hudi tables, causing the HoodieParquetInputFormat to create a metaClient for each split. Constructing a metaClient requires listing the Hudi table metadata directory, which puts significant pressure on the HDFS NameNode when there are a large number of splits.

If the user is certain that there will be no schema changes, this overhead is unnecessary. Although the current implementation supports controlling schema evolution via the hudi.hive.schema.evolution parameter, it does not skip schema evolution when this parameter is explicitly set to false. Consequently, the metaClient is still created, and the metadata directory is listed.

I propose that when hudi.hive.schema.evolution is explicitly set to false, the construction of the metaClient should be skipped. This way, the doEvolutionForRealtimeInputFormat method would return immediately if the internalSchemaOption is empty, avoiding unnecessary metadata directory listing.

I am looking for suggestions or best practices on how to optimize this process. Specifically, I would like to know if there are any strategies or configurations that can reduce the number of list requests initiated during split reads while schema evolution is enabled.

image image

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

danny0405 commented 1 month ago

The metaClient only list metadata files with call of #getActiveTimeline, which is triggered by #getLatestCommitMetadataWithValidSchema, so I think your analysis is reasonable. Maybe we can just pass around the Option<InternalSchema> to the constructor of SchemaEvolutionContext, currently one reader handles one HoodieRealtimeFileSplit, so maybe we can change it to share a singleton Option<InternalSchema> for each reader.

muyihao commented 1 month ago

The metaClient only list metadata files with call of #getActiveTimeline, which is triggered by #getLatestCommitMetadataWithValidSchema, so I think your analysis is reasonable. Maybe we can just pass around the Option<InternalSchema> to the constructor of SchemaEvolutionContext, currently one reader handles one HoodieRealtimeFileSplit, so maybe we can change it to share a singleton Option<InternalSchema> for each reader.

Thank you for your reply. You are right, metadata files are only listed when calling #getActiveTimeline, and constructing the metaClient only reads the hoodie.properties file. Using a singleton Option<InternalSchema>is a great idea. Specifically, maybe we can fetch the InternalSchema during the split computation and place it into the JobConf, so that each reader can directly obtain the schema from the JobConf.

muyihao commented 1 month ago

The metaClient only list metadata files with call of #getActiveTimeline, which is triggered by #getLatestCommitMetadataWithValidSchema, so I think your analysis is reasonable. Maybe we can just pass around the Option<InternalSchema> to the constructor of SchemaEvolutionContext, currently one reader handles one HoodieRealtimeFileSplit, so maybe we can change it to share a singleton Option<InternalSchema> for each reader.

Thank you for your reply. You are right, metadata files are only listed when calling #getActiveTimeline, and constructing the metaClient only reads the hoodie.properties file. Using a singleton Option<InternalSchema>is a great idea. Specifically, maybe we can fetch the InternalSchema during the split computation and place it into the JobConf, so that each reader can directly obtain the schema from the JobConf.

And maybe place InternalSchema into JobConf only when hudi.hive.schema.evolution is true.

danny0405 commented 1 month ago

yeah, let's give it a try, can you fire a fix for it?

muyihao commented 1 month ago

sure, I'll do it.