benwatson528 / intellij-avro-parquet-plugin

A Tool Window plugin for IntelliJ that displays Avro and Parquet files and their schemas in JSON.
Apache License 2.0
43 stars 9 forks source link

plugin crashes when trying to open a .parquet file #42

Closed vdemarcus closed 4 years ago

vdemarcus commented 4 years ago

thanks for your work on this plugin! it is really useful when working with parquet files.

Unfortunately, when trying to open a specific parquet file with the plugin in pycharm, I get the following stack trace: 2020-08-11 16:51:48,112 [7196977] ERROR - ij.viewer.FileViewerToolWindow - Unable to process file com.eclipsesource.json.ParseException: Expected ',' or '}' at 1:149 at com.eclipsesource.json.JsonParser.error(JsonParser.java:490) at com.eclipsesource.json.JsonParser.expected(JsonParser.java:486) at com.eclipsesource.json.JsonParser.readObject(JsonParser.java:251) at com.eclipsesource.json.JsonParser.readValue(JsonParser.java:177) at com.eclipsesource.json.JsonParser.parse(JsonParser.java:152) at com.eclipsesource.json.JsonParser.parse(JsonParser.java:91) at com.eclipsesource.json.Json.parse(Json.java:295) at com.github.wnameless.json.flattener.JsonFlattener.(JsonFlattener.java:155) at com.github.wnameless.json.flattener.JsonFlattener.flatten(JsonFlattener.java:100) at uk.co.hadoopathome.intellij.viewer.table.TableFormatter.(TableFormatter.java:22) at uk.co.hadoopathome.intellij.viewer.table.JTableHandler.updateTable(JTableHandler.java:28) at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:181) at uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:171) at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

Please see attached the parquet file which I don't manage to open. data.zip

Most likely there is something wrong with the file, but I am able to open it in python with the following code:

from pyarrow import parquet

res = parquet.read_table("data.parquet").to_pandas()
print(res)

thanks in advance for your help.

benwatson528 commented 4 years ago

Hi, thanks for the detailed ticket, it always makes it much easier for me to investigate issues. I'm currently in the middle of moving house so I'll look at it this weekend.

Ben

benwatson528 commented 4 years ago

I have had some time to start looking at this - the records are successfully read by the Parquet reader, but the JSON flattening library that I use doesn't like the dates, which contain characters but aren't surrounded by quotes:

image

I'm looking into what I can do about this.

benwatson528 commented 4 years ago

I have raised a question on StackOverflow, but I would be interested to know how the date fields in this data were created in the first place? I have existing tests containing dates that are successfully parsed.

benwatson528 commented 4 years ago

I'm going to close this as I can't debug further without any information about how the data was generated.

I've confirmed that the invalid JSON is given to me by the parquet-avro library, so your best bet will be raising an issue there.

Let me know if I can help further.