Open asfimport opened 2 years ago
James Turton: Bump?
Aaron Blake Niskode-Dossett / @dossett:
[~dzamo]
I'm not a committer on this project, but if you'd like to submit a fix for this I'd be happy to review it (for whatever that would be worth).
I looked at the code and a couple of thoughts:
The hadoop-common library does (de)compressor pooling and has a DoNotPool annotation used by, yep, BuiltInGzipCompressor|https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zlib/BuiltInGzipDecompressor.java#L35].]
Timothy Miller / @theosib-amazon: Does the resolution of DRILL-8139 mean that PARQUET-2126 is also resolved? Or are there more steps that are required?
James Turton: @theosib-amazon DRILL-8139 was just a workaround for this library bug in application code. @dossett's comment contains some good insights about how to fix the underlying problem. I've got a long to-do list at the moment but I'll try to get back here eventually if a maintainer doesn't take care of this first.
Timothy Miller / @theosib-amazon: This bug isn't affecting me. My employer has tasked me with improving the performance of Trino on Parquet sources, and I've found that to do that, I have to make improvements to ParquetMR. Well, while I have to learn a lot about ParquetMR anyway, I might as well help fix some bugs in the process. I've somewhat randomly picked PARQUET-2069 to work on, but it would be even better to work on what the maintainers think are the most important problems. Let me know.
James Turton: @theosib-amazon I sent an email to the dev mailing list three days ago in the hopes that a maintainer would come past here. Perhaps one still will, not sure.
Timothy Miller / @theosib-amazon: Alright. You have a point. If the maintainers want me to delete that stuff, they can let me know, and I'll go ahead and do it.
Parth Chandra / @parthchandra: FWIW, I just submitted a PR to implement async io for the parquet file reader and in my testing with gzip, did not hit this corruption issue. Doesn't mean the issue isn't there and I'm sure my PR will increase the likelihood of hitting this. This fix is highly appreciated.
Timothy Miller / @theosib-amazon: Thanks, Steve. I really like your suggestion. I think we should already hold off on this PR because I'm pretty sure the double-pooling is a problem. So I'll see about looking into this and maybe implement something completely different.
The code for returning Compressor objects to the caller goes to some lengths to achieve thread safety, including keeping Codec objects in an Apache Commons pool that has thread-safe borrow semantics. This is all undone by the BytesCompressor and BytesDecompressor Maps in org.apache.parquet.hadoop.CodecFactory which end up caching single compressor and decompressor instances due to code in CodecFactory@getCompressor and CodecFactory@getDecompressor. When the caller runs multiple threads, those threads end up sharing compressor and decompressor instances.
For compressors based on Xerial Snappy this bug has no effect because that library is itself thread safe. But when BuiltInGzipCompressor from Hadoop is selected for the CompressionCodecName.GZIP case, serious problems ensue. That class is not thread safe and sharing one instance of it between threads produces both silent data corruption and JVM crashes.
To fix this situation, parquet-mr should stop caching single compressor and decompressor instances.
Reporter: James Turton
Related issues:
Note: This issue was originally created as PARQUET-2126. Please see the migration documentation for further details.