[CPP] Unable to read files written by parquet-cpp from parquet-tools

apache / parquet-java

Apache Parquet Java

https://parquet.apache.org/

Apache License 2.0

2.65k stars 1.41k forks source link

[CPP] Unable to read files written by parquet-cpp from parquet-tools #2033

Open asfimport opened 7 years ago

asfimport commented 7 years ago

I could not read files written by parquet-cpp from parquet-tools and Hive. Setting field ids in the schema metadata seems to be the problem. We should make setting the field_id optional.

Reporter: Deepak Majeti / @majetideepak

Original Issue Attachments:

parquet_cpp_example.parquet

_{Note: This issue was originally created as PARQUET-838. Please see the migration documentation for further details.}

asfimport commented 7 years ago

Deepak Majeti / @majetideepak: @wesm, @xhochy can you verify this issue on your side ? Thanks!

asfimport commented 7 years ago

Wes McKinney / @wesm: What version of parquet-tools and Hive? I'm looking into it

asfimport commented 7 years ago

Deepak Majeti / @majetideepak: I tested with parquet-tools-1.9.0 and Hive 1.2

asfimport commented 7 years ago

Wes McKinney / @wesm: I'm able to read files written by parquet-cpp with the cat command in parquet-tools 1.5.0 and 1.9.0. Any way to reproduce?

asfimport commented 7 years ago

Deepak Majeti / @majetideepak: Can you cat this file ?

asfimport commented 7 years ago

Deepak Majeti / @majetideepak: I get the following error. It goes away if I do not set the field_id in the SchemaElement.

$ java -jar parquet-tools-1.9.0.jar cat parquet_cpp_example.parquet

Could not read footer: java.lang.RuntimeException: shaded.parquet.org.codehaus.jackson.map.JsonMappingException: No serializer found for class org.apache.parquet.schema.Type$ID and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS) ) (through reference chain: org.apache.parquet.hadoop.metadata.ParquetMetadata["fileMetaData"]~~>org.apache.parquet.hadoop.metadata.FileMetaData["schema"]~~>org.apache.parquet.schema.MessageType["fields"]~~>java.util.ArrayList[0]~~>org.apache.parquet.schema.PrimitiveType["id"])

asfimport commented 7 years ago

Deepak Majeti / @majetideepak: I don't think fieldIds are implemented in parquet-mr as well. A grep on the codebase does NOT show them being set.

asfimport commented 7 years ago

Wes McKinney / @wesm: I don't get an error on my environment, but nothing useful

$ java -jar target/parquet-tools-1.9.0.jar cat parquet_cpp_example.parquet 
org/apache/hadoop/fs/Path

I'm OK with nixing the field_id field in parquet-cpp to make this go away. Do you want to do that, or I can quickly write a patch, too?

asfimport commented 7 years ago

Deepak Majeti / @majetideepak: Nixing sounds good. If you are at it, please write a patch. Thanks!

asfimport commented 7 years ago

Wes McKinney / @wesm: I included this in my patch for PARQUET-842: https://github.com/apache/parquet-cpp/pull/226

asfimport commented 7 years ago

Uwe Korn / @xhochy: This problem was recently on the ML and @julienledem suggested:

This looks like a bug in parquet-tools when printing the schema to the console. Possibly adding a @JsonValue annotation to intValue() [1] in Type would fix it. [1] https://github.com/apache/parquet-mr/blob/89e0607cf6470dda1a6a47b46abf37468df4e50f/parquet-column/src/main/java/org/apache/parquet/schema/Type.java#L48

Which rather sounds like this is really a parquet-tools problem and not a parquet-cpp one. Still the Impala problem with these fields persist.

asfimport commented 7 years ago

Wes McKinney / @wesm: I'll create a separate JIRA about debugging the Impala issue