Open muyihao opened 1 month ago
The metaClient only list metadata files with call of #getActiveTimeline
, which is triggered by #getLatestCommitMetadataWithValidSchema
, so I think your analysis is reasonable. Maybe we can just pass around the Option<InternalSchema>
to the constructor of SchemaEvolutionContext
, currently one reader handles one HoodieRealtimeFileSplit
, so maybe we can change it to share a singleton Option<InternalSchema>
for each reader.
The metaClient only list metadata files with call of
#getActiveTimeline
, which is triggered by#getLatestCommitMetadataWithValidSchema
, so I think your analysis is reasonable. Maybe we can just pass around theOption<InternalSchema>
to the constructor ofSchemaEvolutionContext
, currently one reader handles oneHoodieRealtimeFileSplit
, so maybe we can change it to share a singletonOption<InternalSchema>
for each reader.
Thank you for your reply. You are right, metadata files are only listed when calling #getActiveTimeline
, and constructing the metaClient only reads the hoodie.properties file. Using a singleton Option<InternalSchema>
is a great idea. Specifically, maybe we can fetch the InternalSchema during the split computation and place it into the JobConf
, so that each reader can directly obtain the schema from the JobConf
.
The metaClient only list metadata files with call of
#getActiveTimeline
, which is triggered by#getLatestCommitMetadataWithValidSchema
, so I think your analysis is reasonable. Maybe we can just pass around theOption<InternalSchema>
to the constructor ofSchemaEvolutionContext
, currently one reader handles oneHoodieRealtimeFileSplit
, so maybe we can change it to share a singletonOption<InternalSchema>
for each reader.Thank you for your reply. You are right, metadata files are only listed when calling
#getActiveTimeline
, and constructing the metaClient only reads the hoodie.properties file. Using a singletonOption<InternalSchema>
is a great idea. Specifically, maybe we can fetch the InternalSchema during the split computation and place it into theJobConf
, so that each reader can directly obtain the schema from theJobConf
.
And maybe place InternalSchema
into JobConf only when hudi.hive.schema.evolution
is true.
yeah, let's give it a try, can you fire a fix for it?
sure, I'll do it.
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced HUDI-5000 introduced schema evolution for Hive reading Hudi tables, causing the HoodieParquetInputFormat to create a metaClient for each split. Constructing a metaClient requires listing the Hudi table metadata directory, which puts significant pressure on the HDFS NameNode when there are a large number of splits.
If the user is certain that there will be no schema changes, this overhead is unnecessary. Although the current implementation supports controlling schema evolution via the hudi.hive.schema.evolution parameter, it does not skip schema evolution when this parameter is explicitly set to false. Consequently, the metaClient is still created, and the metadata directory is listed.
I propose that when hudi.hive.schema.evolution is explicitly set to false, the construction of the metaClient should be skipped. This way, the doEvolutionForRealtimeInputFormat method would return immediately if the internalSchemaOption is empty, avoiding unnecessary metadata directory listing.
I am looking for suggestions or best practices on how to optimize this process. Specifically, I would like to know if there are any strategies or configurations that can reduce the number of list requests initiated during split reads while schema evolution is enabled.
To Reproduce
Steps to reproduce the behavior:
1. 2. 3. 4.
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.