apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.43k forks source link

[HUDI-8501] Improve SizeAwareDataInputStream to implement idempotent #12231

Open usberkeley opened 1 week ago

usberkeley commented 1 week ago

Change Logs

  1. Improve SizeAwareDataInputStream to implement idempotent
  2. Improve SizeAwareDataInputStream#skipBytes to avoid potential skip byte count errors

Impact

Enhance robustness

Risk level (write none, low medium or high below)

none

Documentation Update

none

Contributor's checklist

usberkeley commented 1 week ago

@hudi-bot run azure

hudi-bot commented 1 week ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build
danny0405 commented 2 days ago

Can a corrupt SizeAwareDataInputStream be reused in other codes?

usberkeley commented 2 days ago

Can a corrupt SizeAwareDataInputStream be reused in other codes?

A corrupt SizeAwareDataInputStream cannot be used in other code because it lacks robustness. Under boundary conditions, skipBytes may go out of bounds, and APIs like readInt are non-idempotent, posing potential risks.

danny0405 commented 2 days ago

A corrupt SizeAwareDataInputStream cannot be used in other code

I mean if there is an opportunity for that to happen.

usberkeley commented 1 day ago

A corrupt SizeAwareDataInputStream cannot be used in other code

I mean if there is an opportunity for that to happen.

Oh, I see now. The current call chain doesn't trigger it; this is just a code optimization PR