Plugin crashes when trying to read compressed Parquet file

jnareb commented 2 years ago

When trying to analyze Parquet file with the Avro and Parquer Viewer (3.0.0) plugin in PyCharm () the plugin crashes if the Parquet file was created with engine="fastparquet" and any compression, whether it uses compression="gzip" or compression="brotli".

For example:

arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 2), columns=["A", "B"], index=arrays)
df.to_parquet('test.parquet', engine='fastparquet', compression='brotli')

I would expect the plugin to tell that it cannot handle compressed Parquet files, not crash when trying to parse them:

stack trace - click to expand

``` Unable to process file test.parquet org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file uk.co.hadoopathome.intellij.viewer.fileformat.LocalInputFile@3287bbef at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) at uk.co.hadoopathome.intellij.viewer.fileformat.ParquetFileReader.getRecords(ParquetFileReader.java:99) at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:193) at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:184) at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.BrotliCodec was not found at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:243) at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.(CodecFactory.java:96) at org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:212) at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:201) at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:42) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1519) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402) at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928) at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:956) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225) ... 11 more Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.BrotliCodec at com.intellij.util.lang.UrlClassLoader.findClass(UrlClassLoader.java:229) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:237) ... 22 more ```

benwatson528 commented 2 years ago

Thanks for the code that lets me generate a sample file, much appreciated. I can get gzip files to work on 3.0.0 of the plugin (Windows 11) - they're bundled with Hadoop and so should work. The same is true for Snappy and ZSTD.

I'm having issues with Brotli which I'm looking into. It's not bundled with Hadoop and so I have to pull it from elsewhere.

benwatson528 commented 2 years ago

Please can you uninstall the plugin and re-install from this zip to test for me? Brotli needs a native library installing so I want to see if it works for you. https://drive.google.com/file/d/1Ts4oTgOUZg-trcaHP6VlzH1XcWaOGtvz/view?usp=sharing. Local zip installation is done by clicking the cog on the plugins page (see picture). If it works for you then I'll release it properly.

jnareb commented 2 years ago

Unfortunately the problem persists (the plugin crashes, instead of loading file or at least telling that it cannot load file), but the details of the stack trace changed.

Now the problem seems to be java.lang.UnsatisfiedLinkError: Couldn't load native library 'brotli'.

MS Widows 10 Home, version 21H1, build 19043.1586
PyCharm Professiona 2021.3.3

stack trace - click to expand

``` Unable to process file test.parquet org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file uk.co.hadoopathome.intellij.viewer.fileformat.LocalInputFile@31169618 at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) at uk.co.hadoopathome.intellij.viewer.fileformat.ParquetFileReader.getRecords(ParquetFileReader.java:99) at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:193) at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:184) at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:137) at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:239) at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.(CodecFactory.java:96) at org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:212) at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:201) at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:42) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1519) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402) at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928) at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:956) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225) ... 11 more Caused by: java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:135) ... 23 more Caused by: java.lang.UnsatisfiedLinkError: Couldn't load native library 'brotli'. [LoaderResult: os.name="Windows 10", os.arch="amd64", os.version="10.0", java.vm.name="OpenJDK 64-Bit Server VM", java.vm.version="11.0.14.1+1-b1751.46", java.vm.vendor="JetBrains s.r.o.", alreadyLoaded="null", loadedFromSystemLibraryPath="false", nativeLibName="brotli.dll", temporaryLibFile="C:\Users\jnare\AppData\Local\Temp\brotli1400927330175966707\brotli.dll", libNameWithinClasspath="/lib/win32-x86-amd64/brotli.dll", usedThisClassloader="false", usedSystemClassloader="false", java.library.path="[...]"] at org.meteogroup.jbrotli.libloader.BrotliLibraryLoader.loadBrotli(BrotliLibraryLoader.java:35) at org.apache.hadoop.io.compress.BrotliCodec.(BrotliCodec.java:40) ... 28 more ```

benwatson528 commented 2 years ago

That's the same as I'm seeing. The plugin contains the Brotli compression class now, but it needs to find the Brotli native library on the host machine to be able to process the file. There are a few libraries out there but none have been updated in the last few years. I'm going to keep looking.

benwatson528 commented 2 years ago

It looks like this is an issue with Windows and the brotli-codec repo no longer accepting PRs - see https://martin-grigorov.medium.com/javas-serviceloader-api-using-native-libraries-nok-8ad7307e6d07 and https://github.com/rdblue/brotli-codec/pull/2.

I'm not sure there's much I can do here unless Brotli support is explicitly added in Hadoop, which has been in progress for 6 years now - https://issues.apache.org/jira/browse/HADOOP-13126. You could always install IntelliJ inside a WSL instance, although I appreciate that's not ideal.

jnareb commented 2 years ago

Well, I would appreciate the plugin not crashing at least.

I wonder where the fastparquet Python module (that is used to create the Parquet file) is finding the brotli library to use...

benwatson528 commented 2 years ago

That's because https://pypi.org/project/brotlipy/ supports Windows.

benwatson528 / intellij-avro-parquet-plugin

Plugin crashes when trying to read compressed Parquet file #96