Open mzheng-plaid opened 2 months ago
Ok I think the root cause is because the upgrade silently turned on schema validation with https://github.com/apache/hudi/blob/release-0.14.1/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java#L822
boolean shouldValidate = config.shouldValidateAvroSchema();
boolean allowProjection = config.shouldAllowAutoEvolutionColumnDrop();
if ((!shouldValidate && allowProjection)
|| getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty()
|| StringUtils.isNullOrEmpty(config.getSchema())
) {
// Check not required
return;
}
Previously in 0.12.2 https://github.com/apache/hudi/blob/release-0.12.2/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java#L749C1-L754C6
if (!config.getAvroSchemaValidate() || getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty()) {
// Check not required
return;
}
Questions:
hoodie.datasource.write.schema.allow.auto.evolution.column.drop
and disabling schema validation? Why is schema validation silently turned on by default now? Referencing slack thread for this discussion - https://apache-hudi.slack.com/archives/C4D716NPQ/p1723826030979209
@ad1happy2go hmm setting hoodie.datasource.write.schema.allow.auto.evolution.column.drop
to true
still doesn't skip the schema validation check, any idea why?
We're hard blocked by this issue on upgrading and its quite painful, let me know if you have any ideas on how to work around this
It seems like there is a second issue, if you use the default configs for Hudi this line (https://github.com/apache/hudi/blob/release-0.14.1/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L751 ):
parameters.getOrDefault(HoodieWriteConfig.AVRO_SCHEMA_VALIDATE_ENABLE.key(), "true")
Will actually just ignore the default for hoodie.avro.schema.validate
and enable schema validation silently... you actually need to explicitly set hoodie.avro.schema.validate
to false
now
Seems like this regression was introduced in https://github.com/apache/hudi/commit/06c8fa5a62fab607d3be6e321a580d9cf13b572a#diff-8bda4b2174721fd642a543528283[…]a320c1d9e1366b27be86bd548d48aR527
Thanks a lot @mzheng-plaid for detailed explanation and triaging the issue. This sounds reasonable and we should highlight in our release docs. The default config is confusing in this case.
Created JIRA for tracking the fix - https://issues.apache.org/jira/browse/HUDI-8173
Describe the problem you faced
We're upgrading from Hudi 0.12.2 to Hudi 0.14.1 and are running into failures on all of our log ingestion jobs on:
The only diff in the schema seems to be the snippet below in all of our jobs where
meta
is arecord
type (table_name
is a placeholder for every affected table):Oddly, the table version was bumped even though the commit failed, so we ended up having to run a tedious bulk downgrade command. That seems super surprising.
To Reproduce
Unclear, it seems like some one-time upgrade step of the table version did not run for whatever reason?
Expected behavior
Environment Description We are running on EMR 7.2
Hudi version : 0.14.1
Spark version : 3.5.1
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) :