benwatson528 / intellij-avro-parquet-plugin

A Tool Window plugin for IntelliJ that displays Avro and Parquet files and their schemas in JSON.
Apache License 2.0
45 stars 8 forks source link

Plugin crashes when loading parquet files with columns containing dots in their names #79

Closed vladiliescu closed 3 years ago

vladiliescu commented 3 years ago

Loading a parquet file with columns such as temperature.value will crash the plugin.

benwatson528 commented 3 years ago

Does the plugin actually crash? I'd expect it to just display an error message and then return to its starting state.

This is because dots are invalid characters in Avro and this plugin uses org.apache.parquet:parquet-avro to read Parquet files. See https://avro.apache.org/docs/current/spec.html#names for the full naming rules (which includes an explanation as to why dots aren't used). I tend to stick to underscores as separators.

vladiliescu commented 3 years ago

Not sure what counts as a crash to be honest, but I do get an Error prompt plus an IDE Fatal Errors bubble prompt with a stack trace and everything (included below)

Unable to process file /<edited>/data_all.parquet

org.apache.avro.SchemaParseException: Illegal character in: temperature.value
    at org.apache.avro.Schema.validateName(Schema.java:1566)
    at org.apache.avro.Schema.access$400(Schema.java:91)
    at org.apache.avro.Schema$Field.<init>(Schema.java:546)
    at org.apache.avro.Schema$Field.<init>(Schema.java:585)
    at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:280)
    at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:264)
    at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:134)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:185)
    at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
    at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
    at uk.co.hadoopathome.intellij.viewer.fileformat.ParquetFileReader.getRecords(ParquetFileReader.java:99)
    at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:193)
    at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:184)
    at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

Yup, another workaround is to simply remove the dots from the files, just that I didn't expect this error to occur since this seems to be a valid parquet file.

benwatson528 commented 3 years ago

Thanks for the error. I'm afraid I can't fix this as it's inside the library that the plugin uses to parse the files. I'll look into making the errors more palatable.