apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.43k forks source link

[HUDI-8503] Fix Trino failure when reading corrupted block at end of log file #12237

Open usberkeley opened 1 week ago

usberkeley commented 1 week ago

Change Logs

Background

When a corrupted block appears at the end of a Log file, the Trino Reader (LogScanner) fails to read it. This is because Hudi attempts to use InputStream#seek to locate the end of the LogBlock to check for corruption. However, Trino's TrinoInputStream#seek does not necessarily throw an EOFException when seeking beyond the end of the file. In some file systems, such as AzureInputStream#seek, it may throw an IOException.

Ref: trino-filesystem-azure AzureInputStream#seek

    @Override
    public void seek(long newPosition)
            throws IOException
    {
        ensureOpen();
        if (newPosition < 0) {
            throw new IOException("Negative seek offset");
        }
        if (newPosition > fileSize) {
            throw new IOException("Cannot seek to %s. File size is %s: %s".formatted(newPosition, fileSize, location));
        }
        nextPosition = newPosition;
    }

Solution

Since we cannot control how the query side handles exceptions when seeking beyond the end, it is recommended to: By comparing the end position of the LogBlock with the end position of the file, we can determine if there is enough space to read the LogBlock.

Impact

none

Risk level (write none, low medium or high below)

none

Documentation Update

none

Contributor's checklist

hudi-bot commented 2 days ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build