benwatson528 / intellij-avro-parquet-plugin

A Tool Window plugin for IntelliJ that displays Avro and Parquet files and their schemas in JSON.
Apache License 2.0
43 stars 9 forks source link

Unable to process file produced by parquet4s #46

Closed gygabyte closed 3 years ago

gygabyte commented 3 years ago

I am getting this error when processing a file that is produced by the the scala library parquet4s. In this case it was written using snappy compression, but the same issue occurs for an uncompressed file

Unable to process file  <filename>.snappy.parquet

org.apache.avro.SchemaParseException: Illegal character in: parquet4s-schema
    at org.apache.avro.Schema.validateName(Schema.java:1530)
    at org.apache.avro.Schema.access$400(Schema.java:87)
    at org.apache.avro.Schema$Name.<init>(Schema.java:673)
    at org.apache.avro.Schema.createRecord(Schema.java:212)
    at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:270)
    at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:248)
    at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
    at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
    at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
    at uk.co.hadoopathome.intellij.viewer.fileformat.ParquetFileReader.getRecords(ParquetFileReader.java:44)
    at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:180)
    at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:171)
    at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)

Thanks Pedro

benwatson528 commented 3 years ago

Hi Pedro,

This is because I use org.apache.parquet.avro.AvroParquetReader to read files, and hyphens (-) are invalid characters in Avro schemas.

I'm afraid that this isn't something that I can change, but replacing hyphens with underscores in your column names would resolve this.

Thanks,

Ben

gygabyte commented 3 years ago

Hi Ben,

the column names don't have hyphens, only underscores.

04-45.parquet.zip

benwatson528 commented 3 years ago

https://github.com/mjakubowski84/parquet4s/issues/131 - if you update to parquet4s 1.1.0 then this issue will be resolved.

I hope this helps.

Ben

gygabyte commented 3 years ago

ok, sorry for the disturbance :) thanks a lot

benwatson528 commented 3 years ago

Not at all, happy to help. It's surprising how inconsistent and weird all of the Parquet and Avro libraries are. When I made this plugin I thought it would be 5 lines of code and I'd never have to worry about it again!