Open hellishfire opened 2 months ago
@hellishfire, How users would not notice the corruption when the related exception is thrown? My change is related to auto-closing to not to write anything automatically (at closing) after an exception happened. The users of the API shall not continue writing data after an exception.
@hellishfire, How users would not notice the corruption when the related exception is thrown? My change is related to auto-closing to not to write anything automatically (at closing) after an exception happened. The users of the API shall not continue writing data after an exception.
We found that some users still try to write after exception, then successfully close the file, only to find out later that the file can not be opened for read because of missing file footer. Perhaps it's better to deny any further write to make it fool-proof?
@gszadovszky On another note, in InternalParquetRecordWriter.java
public void write(T value) throws IOException, InterruptedException {
try {
writeSupport.write(value);
++recordCount;
checkBlockSizeReached();
} catch (Throwable t) {
aborted = true;
throw t;
}
}
Should we narrow the try-catch scope to exclude exceptions thrown by writeSupport (i.e. illegal input values denied by writeSupport), which is irrelevant to actual file io operation?
We found that some users still try to write after exception, then successfully close the file, only to find out later that the file can not be opened for read because of missing file footer. Perhaps it's better to deny any further write to make it fool-proof?
I don't think parquet-java is fool-proof in any ways. And how do you think we should handle the case of continue writing with a writer that is already "aborted"? Throw yet another exception? I am not against it but cannot see the point. Also, it shall not significantly impact write performance.
Should we narrow the try-catch scope to exclude exceptions thrown by writeSupport (i.e. illegal input values denied by writeSupport), which is irrelevant to actual file io operation?
It would make sense but I am not sure if we can 100% differentiate the potential exceptions and whether we already had impact on the file writing at the throwing time or not.
Feel free to put up a PR if you would like to work on this. I'm happy to review.
We found that some users still try to write after exception, then successfully close the file, only to find out later that the file can not be opened for read because of missing file footer. Perhaps it's better to deny any further write to make it fool-proof?
I don't think parquet-java is fool-proof in any ways. And how do you think we should handle the case of continue writing with a writer that is already "aborted"? Throw yet another exception? I am not against it but cannot see the point. Also, it shall not significantly impact write performance.
Should we narrow the try-catch scope to exclude exceptions thrown by writeSupport (i.e. illegal input values denied by writeSupport), which is irrelevant to actual file io operation?
It would make sense but I am not sure if we can 100% differentiate the potential exceptions and whether we already had impact on the file writing at the throwing time or not.
Feel free to put up a PR if you would like to work on this. I'm happy to review.
OK. Would it be possible for you to elaborate a bit on why aborted state was introduced in the first place? Like, which user case triggered a bug that led you to fix it? It was not explained clearly in your commit message, so I somewhat lack the context of that commit.
@hellishfire, my bad, should have explained in more details. In many cases, Parquet writers are used in a try-with-resources construct. It invokes close
in any case including exceptions thrown during the writes. close
invokes flush by default which actually does some writing. It may cause issues like throwing exceptions while closing which is not always a problem. But, if you would use e.g. direct memory via the ByteBufferAllocator
interface, it needs to be released and if fails, it may lead to memory leaks.
Describe the bug, including details regarding any error messages, version, and platform.
After PARQUET-2437, InternalParquetRecordWriter will switch to aborted state if write throws, but still accept further writes even though these writes will not be flushed during close, leaving the file incomplete. Users will not notice the corrupted file until they actually read the file.
Should we check for aborted state and reject further writes to avoid such surprises? @gszadovszky
Component(s)
No response