Thread safety bug in CodecFactory

asfimport commented 2 years ago

The code for returning Compressor objects to the caller goes to some lengths to achieve thread safety, including keeping Codec objects in an Apache Commons pool that has thread-safe borrow semantics. This is all undone by the BytesCompressor and BytesDecompressor Maps in org.apache.parquet.hadoop.CodecFactory which end up caching single compressor and decompressor instances due to code in CodecFactory@getCompressor and CodecFactory@getDecompressor. When the caller runs multiple threads, those threads end up sharing compressor and decompressor instances.

For compressors based on Xerial Snappy this bug has no effect because that library is itself thread safe. But when BuiltInGzipCompressor from Hadoop is selected for the CompressionCodecName.GZIP case, serious problems ensue. That class is not thread safe and sharing one instance of it between threads produces both silent data corruption and JVM crashes.

To fix this situation, parquet-mr should stop caching single compressor and decompressor instances.

Reporter: James Turton

Related issues:

Parquet CodecFactory thread safety bug (fixes)

_{Note: This issue was originally created as PARQUET-2126. Please see the migration documentation for further details.}

asfimport commented 2 years ago

James Turton: Bump?

asfimport commented 2 years ago

Aaron Blake Niskode-Dossett / @dossett: [~dzamo] I'm not a committer on this project, but if you'd like to submit a fix for this I'd be happy to review it (for whatever that would be worth).

I looked at the code and a couple of thoughts:

The hadoop-common library does (de)compressor pooling and has a DoNotPool annotation used by, yep, BuiltInGzipCompressor|https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zlib/BuiltInGzipDecompressor.java#L35].]

Perhaps Parquet could look for that annotation and us/reuse accordingly?
It's also possible that parquet doesn't need to cache the (de)compressors at all since the hadoop CodecPool might be doing this already (and respecting the DoNotPool annotation)? Most of this parquet code is over nine years old so it's hard to tell.

asfimport commented 2 years ago

Timothy Miller / @theosib-amazon: Does the resolution of DRILL-8139 mean that PARQUET-2126 is also resolved? Or are there more steps that are required?

asfimport commented 2 years ago

James Turton: @theosib-amazon DRILL-8139 was just a workaround for this library bug in application code. @dossett's comment contains some good insights about how to fix the underlying problem. I've got a long to-do list at the moment but I'll try to get back here eventually if a maintainer doesn't take care of this first.

asfimport commented 2 years ago

Timothy Miller / @theosib-amazon: This bug isn't affecting me. My employer has tasked me with improving the performance of Trino on Parquet sources, and I've found that to do that, I have to make improvements to ParquetMR. Well, while I have to learn a lot about ParquetMR anyway, I might as well help fix some bugs in the process. I've somewhat randomly picked PARQUET-2069 to work on, but it would be even better to work on what the maintainers think are the most important problems. Let me know.

asfimport commented 2 years ago

James Turton: @theosib-amazon I sent an email to the dev mailing list three days ago in the hopes that a maintainer would come past here. Perhaps one still will, not sure.

asfimport commented 2 years ago

Timothy Miller / @theosib-amazon: Alright. You have a point. If the maintainers want me to delete that stuff, they can let me know, and I'll go ahead and do it.

asfimport commented 2 years ago

Parth Chandra / @parthchandra: FWIW, I just submitted a PR to implement async io for the parquet file reader and in my testing with gzip, did not hit this corruption issue. Doesn't mean the issue isn't there and I'm sure my PR will increase the likelihood of hitting this. This fix is highly appreciated.

asfimport commented 2 years ago

Timothy Miller / @theosib-amazon: Thanks, Steve. I really like your suggestion. I think we should already hold off on this PR because I'm pretty sure the double-pooling is a problem. So I'll see about looking into this and maybe implement something completely different.

apache / parquet-java

Thread safety bug in CodecFactory #2669

Related issues: