benwatson528 / intellij-avro-parquet-plugin

A Tool Window plugin for IntelliJ that displays Avro and Parquet files and their schemas in JSON.
Apache License 2.0
43 stars 9 forks source link

Read parquet file which contains array field #21

Closed sananguliyev closed 4 years ago

sananguliyev commented 4 years ago

Hi Ben,

The plugin throw and error when I try to view the parquet file with an array field. You can download the parquet file if you need to reproduce the case.

File: https://www.transfernow.net/grbC2h032020 IDE: GoLand 2019.3 Plugin version: 1.1.1 Error:

Unable to process file

org.apache.avro.SchemaParseException: Illegal character in: parquet-go-root
    at org.apache.avro.Schema.validateName(Schema.java:1532)
    at org.apache.avro.Schema.access$400(Schema.java:87)
    at org.apache.avro.Schema$Name.<init>(Schema.java:675)
    at org.apache.avro.Schema.createRecord(Schema.java:212)
    at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:270)
    at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:248)
    at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
    at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
    at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
    at uk.co.hadoopathome.intellij.viewer.fileformat.ParquetFileReader.getRecords(ParquetFileReader.java:39)
    at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:173)
    at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:164)
    at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
benwatson528 commented 4 years ago

Hi Sanan,

The problem isn't with lists as I have successful unit tests and sample files containing lists. It looks like this Parquet file was made with parquet-go, and that the Parquet implementation in that project creates a file that isn't compatible with the org.apache.parquet:parquet-avro:1.11.0 library that this plugin uses to read Parquet files.

Avro has some strict rules for schema naming, and the parquet-go library seems to introduce a field called parquet-go-root that breaks these rules. Sadly I can't fix this issue, but you may want to raise an issue with the parquet-go library, or check the way that this file is created.

sananguliyev commented 4 years ago

Thank you very much for the quick reply and this amazing plugin.

P.S. I will check the differences but it works fine when I create a parquet file without a list. Most probably it's not because of the parquet-go-root field.

benwatson528 commented 4 years ago

No problem, thanks for giving a sample file and stack trace, that makes everything a lot easier. Let me know if you ever get anywhere with this or if I can help.

benwatson528 commented 4 years ago

Just saw your edit - maybe it's something to do with how parquet-go specifically handles lists?

sananguliyev commented 4 years ago

I still do not know why the old parquet files I created without a list field but the main problem is -. The library org.apache.parquet:parquet-avro:1.11.0 counts it as an illegal character. I do not know whether it is parquet rules but I will make pull request to the go library if it's.

Anyway, thanks again for noticing my edit and reply :)