When a corrupted block appears at the end of a Log file, the Trino Reader (LogScanner) fails to read it. This is because Hudi attempts to use InputStream#seek to locate the end of the LogBlock to check for corruption. However, Trino's TrinoInputStream#seek does not necessarily throw an EOFException when seeking beyond the end of the file. In some file systems, such as AzureInputStream#seek, it may throw an IOException.
Ref:
trino-filesystem-azure AzureInputStream#seek
@Override
public void seek(long newPosition)
throws IOException
{
ensureOpen();
if (newPosition < 0) {
throw new IOException("Negative seek offset");
}
if (newPosition > fileSize) {
throw new IOException("Cannot seek to %s. File size is %s: %s".formatted(newPosition, fileSize, location));
}
nextPosition = newPosition;
}
Solution
Since we cannot control how the query side handles exceptions when seeking beyond the end, it is recommended to:
By comparing the end position of the LogBlock with the end position of the file, we can determine if there is enough space to read the LogBlock.
Change Logs
Background
When a corrupted block appears at the end of a Log file, the Trino Reader (LogScanner) fails to read it. This is because Hudi attempts to use InputStream#seek to locate the end of the LogBlock to check for corruption. However, Trino's TrinoInputStream#seek does not necessarily throw an EOFException when seeking beyond the end of the file. In some file systems, such as AzureInputStream#seek, it may throw an IOException.
Ref: trino-filesystem-azure AzureInputStream#seek
Solution
Since we cannot control how the query side handles exceptions when seeking beyond the end, it is recommended to: By comparing the end position of the LogBlock with the end position of the file, we can determine if there is enough space to read the LogBlock.
Impact
none
Risk level (write none, low medium or high below)
none
Documentation Update
none
Contributor's checklist