Closed jnareb closed 2 years ago
Thanks for the code that lets me generate a sample file, much appreciated. I can get gzip
files to work on 3.0.0 of the plugin (Windows 11) - they're bundled with Hadoop and so should work. The same is true for Snappy and ZSTD.
I'm having issues with Brotli which I'm looking into. It's not bundled with Hadoop and so I have to pull it from elsewhere.
Please can you uninstall the plugin and re-install from this zip to test for me? Brotli needs a native library installing so I want to see if it works for you. https://drive.google.com/file/d/1Ts4oTgOUZg-trcaHP6VlzH1XcWaOGtvz/view?usp=sharing. Local zip installation is done by clicking the cog on the plugins page (see picture). If it works for you then I'll release it properly.
Unfortunately the problem persists (the plugin crashes, instead of loading file or at least telling that it cannot load file), but the details of the stack trace changed.
Now the problem seems to be java.lang.UnsatisfiedLinkError: Couldn't load native library 'brotli'
.
```
Unable to process file test.parquet
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file uk.co.hadoopathome.intellij.viewer.fileformat.LocalInputFile@31169618
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at uk.co.hadoopathome.intellij.viewer.fileformat.ParquetFileReader.getRecords(ParquetFileReader.java:99)
at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:193)
at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:184)
at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:137)
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:239)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.
That's the same as I'm seeing. The plugin contains the Brotli compression class now, but it needs to find the Brotli native library on the host machine to be able to process the file. There are a few libraries out there but none have been updated in the last few years. I'm going to keep looking.
It looks like this is an issue with Windows and the brotli-codec
repo no longer accepting PRs - see https://martin-grigorov.medium.com/javas-serviceloader-api-using-native-libraries-nok-8ad7307e6d07 and https://github.com/rdblue/brotli-codec/pull/2.
I'm not sure there's much I can do here unless Brotli support is explicitly added in Hadoop, which has been in progress for 6 years now - https://issues.apache.org/jira/browse/HADOOP-13126. You could always install IntelliJ inside a WSL instance, although I appreciate that's not ideal.
Well, I would appreciate the plugin not crashing at least.
I wonder where the fastparquet Python module (that is used to create the Parquet file) is finding the brotli library to use...
That's because https://pypi.org/project/brotlipy/ supports Windows.
When trying to analyze Parquet file with the Avro and Parquer Viewer (3.0.0) plugin in PyCharm () the plugin crashes if the Parquet file was created with
engine="fastparquet"
and any compression, whether it usescompression="gzip"
orcompression="brotli"
.For example:
I would expect the plugin to tell that it cannot handle compressed Parquet files, not crash when trying to parse them:
stack trace - click to expand
``` Unable to process file test.parquet org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file uk.co.hadoopathome.intellij.viewer.fileformat.LocalInputFile@3287bbef at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) at uk.co.hadoopathome.intellij.viewer.fileformat.ParquetFileReader.getRecords(ParquetFileReader.java:99) at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:193) at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:184) at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.BrotliCodec was not found at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:243) at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.(CodecFactory.java:96)
at org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:212)
at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:201)
at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:42)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1519)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402)
at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928)
at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:956)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
... 11 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.BrotliCodec
at com.intellij.util.lang.UrlClassLoader.findClass(UrlClassLoader.java:229)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:237)
... 22 more
```