benwatson528 / intellij-avro-parquet-plugin

A Tool Window plugin for IntelliJ that displays Avro and Parquet files and their schemas in JSON.
Apache License 2.0
43 stars 9 forks source link

Unable to process file #43

Closed sananguliyev closed 3 years ago

sananguliyev commented 4 years ago

Hi,

Not sure the problem is in my parquet file or this plugin has some issues. The parquet file works fine when I read the file with Presto. I quickly checked the file and even works here in this site: http://parquet-viewer-online.com/

You add sample file and exception here: 405580a3-a4b9-4602-b576-84e6b20f2c3a.parquet.zip

Thanks in advance.

Unable to process file

com.eclipsesource.json.ParseException: Expected ',' or '}' at 1:21
    at com.eclipsesource.json.JsonParser.error(JsonParser.java:490)
    at com.eclipsesource.json.JsonParser.expected(JsonParser.java:486)
    at com.eclipsesource.json.JsonParser.readObject(JsonParser.java:251)
    at com.eclipsesource.json.JsonParser.readValue(JsonParser.java:177)
    at com.eclipsesource.json.JsonParser.parse(JsonParser.java:152)
    at com.eclipsesource.json.JsonParser.parse(JsonParser.java:91)
    at com.eclipsesource.json.Json.parse(Json.java:295)
    at com.github.wnameless.json.flattener.JsonFlattener.<init>(JsonFlattener.java:155)
    at com.github.wnameless.json.flattener.JsonFlattener.flatten(JsonFlattener.java:100)
    at uk.co.hadoopathome.intellij.viewer.table.TableFormatter.<init>(TableFormatter.java:22)
    at uk.co.hadoopathome.intellij.viewer.table.JTableHandler.updateTable(JTableHandler.java:28)
    at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:181)
    at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:171)
    at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
benwatson528 commented 4 years ago

Hi Sanan,

Thanks for adding a sample file, it really helps me.

I use the latest version of the parquet-avro Java library to read data - the issue is that it outputs the date field (received_at) without any surrounding quotes, which in turn breaks the JSON library I use to parse the data: image

The received_at field is defined in the schema as:

"name" : "received_at",
"type" : {
    "type" : "long",
    "logicalType" : "timestamp-millis"
}

and I add support for this Logical Type with:

genericData.addLogicalTypeConversion(new TimeConversions.TimeMillisConversion());

I'm not sure why this library would output invalid JSON.

I will dig into it but I suspect I'll have to raise an issue on parquet-avro. In the meantime I will look into having the plugin not load the table pane if invalid JSON is found.

benwatson528 commented 4 years ago

It seems like the answer is for me to decode data into JSON rather than using toString(). https://issues.apache.org/jira/browse/AVRO-2343. You'd expect this to be trivial but nothing is ever easy in Avro and Parquet. I'll take a decent stab at it this weekend.

sananguliyev commented 4 years ago

Thanks for your investigation. I still could continue with int64 but not sure what will be the side effects.

benwatson528 commented 4 years ago

I've asked Stack Overflow - https://stackoverflow.com/questions/63655421/writing-parquet-avro-genericrecord-to-json-while-maintaining-logicaltypes and I'll ask the Parquet mailing list if that doesn't get anywhere.

On Sat, Aug 29, 2020 at 8:36 PM Sanan Guliyev notifications@github.com wrote:

Thanks for your investigation. I still could continue with int64 but not sure what will be the side effects in the future.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benwatson528/intellij-avro-parquet-plugin/issues/43#issuecomment-683333424, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPNI2NJNELUQ4O3LGLIS63SDFKCJANCNFSM4QNMDBYQ .

benwatson528 commented 4 years ago

I've uploaded a new version of the plugin that will at least let you view the raw data and schemas for affected files. The update will be available in the IntelliJ Marketplace within a couple of business days, or if you're very eager you can download it here.

sananguliyev commented 3 years ago

Thank you very much @benwatson528