apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.43k forks source link

[HUDI-8400] apply 'write.ignore.failed' when write data failed v2 #12150

Open fhan688 opened 1 month ago

fhan688 commented 1 month ago

Change Logs

In Flink engine, if exception occurs when task writing data, it will be ignored and the exception will be reported to StreamWriteCoordinator with write event, StreamWriteCoordinator will decide whether to commit when there is write failure according to 'write.ignore.failed'.

This PR apply 'write.ignore.failed' ahead when write failure occurs, thus throw an exception faster.

for example: CP interval of Flink job is 15 minutes, the exception will not be found until CP commit, it will make a longer data latency in real-time sensitive scenarios.

Impact

module: hudi-client、 hudi-flink-datasource

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

fhan688 commented 1 month ago

previous PR was reverted https://github.com/apache/hudi/pull/12136, I reopen it and maybe more discussion is needed. @danny0405

danny0405 commented 4 weeks ago

We should clarify these items:

  1. should we promote the write.ignore.failed option to a common write config for each engine? Previously each eagine has it's own options and behavior.
  2. should we throw the exception in write handles or in the driver(after the write status are collected);
  3. should this option by default false or true?
fhan688 commented 4 weeks ago

We should clarify these items:

  1. should we promote the write.ignore.failed option to a common write config for each engine? Previously each eagine has it's own options and behavior.
  2. should we throw the exception in write handles or in the driver(after the write status are collected);
  3. should this option by default false or true?
  1. I agree. write.ignore.failed is a config in FlinkOptions and we promote it to HoodieWriteConfig in hudi-client-common module and named 'hoodie.write.ignore.failed' in this PR.
  2. I think fast fail is a better choice. In heavy PRD traffic job, several minutes late means huge amounts of records needs to be dealt with when restore.
  3. false, consider data quality.
hudi-bot commented 4 weeks ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build
danny0405 commented 3 weeks ago

@fhan688 Let's fire a JIRA issue around this and move the discussion there.

fhan688 commented 3 weeks ago

OK. https://issues.apache.org/jira/browse/HUDI-8400

danny0405 commented 3 weeks ago

OK. https://issues.apache.org/jira/browse/HUDI-8400

Sorry, I meant the GH issue, which is more easier to communicate.

fhan688 commented 3 weeks ago

OK. https://issues.apache.org/jira/browse/HUDI-8400

Sorry, I meant the GH issue, which is more easier to communicate.

Thanks. https://github.com/apache/hudi/issues/12187