apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.43k forks source link

[SUPPORT] Properties file corruption caused by write failure #11835

Open Ytimetravel opened 2 months ago

Ytimetravel commented 2 months ago

Describe the problem you faced Dear community, Recently I discovered a case: a write failure can cause the hoodi.properties file corrupted. Problem site: imageIt causes other write tasks to fail. The process in which this situation occurs is as follows:

  1. Executing the commit will trigger the maybeDeleteMetadataTable process.(If need)

    image
  2. An exception occurred during the following process, causing the properties file write to fail.

    image image image

File status:properties error(len=0) properties_backup error-free

  1. Then it triggers rollback.

    image image image
  2. Since the table version cannot be correctly obtained at this point, it triggers an upgrade from 0 to 6.

    image image image

File status:properties error(len=0) properties_backup removed

  1. Attempt to create a properties_backup file image image

I think that we should not only check if the hoodie.properties file exists when performing recoverIfNeeded, we need more information to ensure that the hoodie.properties file is correct, rather than directly skipping file processing and deleting the backup file. Any suggestion?

Environment Description

Stacktrace Caused by: org.apache.hudi.exception.HoodieException: Error updating table configs. at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:91) at org.apache.hudi.internal.HoodieDataSourceInternalWriter.commit(HoodieDataSourceInternalWriter.java:91) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:76) ... 69 more Suppressed: java.lang.IllegalArgumentException: hoodie.table.name property needs to be specified at org.apache.hudi.common.table.HoodieTableConfig.generateChecksum(HoodieTableConfig.java:523) at org.apache.hudi.common.table.HoodieTableConfig.getOrderedPropertiesWithTableChecksum(HoodieTableConfig.java:321) at org.apache.hudi.common.table.HoodieTableConfig.storeProperties(HoodieTableConfig.java:339) at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:438) at org.apache.hudi.common.table.HoodieTableConfig.delete(HoodieTableConfig.java:481) at org.apache.hudi.table.upgrade.UpgradeDowngrade.run(UpgradeDowngrade.java:151) at org.apache.hudi.client.BaseHoodieWriteClient.tryUpgrade(BaseHoodieWriteClient.java:1399) at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1255) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1296) at org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:769) at org.apache.hudi.internal.DataSourceInternalWriterHelper.abort(DataSourceInternalWriterHelper.java:99) at org.apache.hudi.internal.HoodieDataSourceInternalWriter.abort(HoodieDataSourceInternalWriter.java:96) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:82) ... 69 more Caused by: org.apache.hudi.exception.HoodieIOException: Error updating table configs. at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:466) at org.apache.hudi.common.table.HoodieTableConfig.update(HoodieTableConfig.java:475) at org.apache.hudi.common.table.HoodieTableConfig.setMetadataPartitionState(HoodieTableConfig.java:816) at org.apache.hudi.common.table.HoodieTableConfig.clearMetadataPartitions(HoodieTableConfig.java:847) at org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:1396) at org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:275) at org.apache.hudi.table.HoodieTable.maybeDeleteMetadataTable(HoodieTable.java:995) at org.apache.hudi.table.HoodieSparkTable.getMetadataWriter(HoodieSparkTable.java:116) at org.apache.hudi.table.HoodieTable.getMetadataWriter(HoodieTable.java:947) at org.apache.hudi.client.BaseHoodieWriteClient.writeTableMetadata(BaseHoodieWriteClient.java:359) at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:285) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:236) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211) at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:88) ... 71 more Caused by: java.io.InterruptedIOException: Interrupted while waiting for data to be acknowledged by pipeline at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:3520) at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:3498) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:3690) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:3625) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:80) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:115) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:80) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:115) at org.apache.hudi.common.fs.SizeAwareFSDataOutputStream.close(SizeAwareFSDataOutputStream.java:75) at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:449) ... 84 more

danny0405 commented 2 months ago

The update to properties file should be atomic, and we already do that for HoodieTableConfig.modify, but it just throws for writer if any exception happens, the reader would still work by reading the back_up file.

we need more information to ensure that the hoodie.properties file is correct, rather than directly skipping file processing and deleting the backup file.

+1 for this, we need to strenthen the handling of the properties file exception for the invoker.

Ytimetravel commented 2 months ago

@danny0405 My current understanding is as follows:

  1. The properties_backup is a copy of the original properties.
  2. The expected outcome is that original properties should be the same as properties_backup. Can we check if original properties is error-free by comparing file sizes?
danny0405 commented 2 months ago

Can we check if original properties is error-free by comparing file sizes?

We have a check-sum in the properties file.

Ytimetravel commented 2 months ago

@danny0405 Sounds good. Can I optimize the decision-making process here?

danny0405 commented 2 months ago

Sure, would be glad to review your fix.

ad1happy2go commented 1 month ago

@Ytimetravel Did you got a chance to work on this? Do we have any JIRA for the same?

nsivabalan commented 2 weeks ago

sorry, I am not sure if I fully understand how exactly we got into corrupt state.

From what I see createMetaClient(true) fails. But if we chase the chain of calls, its ends up with https://github.com/apache/hudi/blob/3a57591152065ddb317c5fe67bab8163730f1e73/hudi-common/src/main/java/org/apache/hudi/common/util/ConfigUtils.java#L541

which actually accounts for reading from either of back up or original property file.

can you help me understand a bit more.