apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.55k stars 1.39k forks source link

Handle INT96 more gracefully in parquet-avro #2496

Closed asfimport closed 4 years ago

asfimport commented 4 years ago

The parquet-avro library does not support INT96 columns (PARQUET-323), and any attempt to process a file containing such a column results in:


throw new IllegalArgumentException("INT96 not implemented and is deprecated");

INT96 is still used in many legacy datasets, and so it would be useful to be able to process Parquet files containing these records, even if the INT96 values themselves aren't rendered.

The same functionality has already been re-added into parquet-pig (PARQUET-1133).

Reporter: Ben Watson

Note: This issue was originally created as PARQUET-1870. Please see the migration documentation for further details.

asfimport commented 4 years ago

Ben Watson: For backstory, I maintain an Avro and Parquet Viewer IntelliJ plugin that allows Avro and Parquet files to be displayed visually, and a repeated complaint is that it's not possible to view files containing INT96 columns.

I have been able to solve this by replacing AvroSchemaConverter#308:


public Schema convertINT96(PrimitiveTypeName primitiveTypeName) {
  throw new IllegalArgumentException("INT96 not implemented and is deprecated");
}

with


public Schema convertINT96(PrimitiveTypeName primitiveTypeName) {
  return Schema.create(Schema.Type.BYTES);
}

This results in gibberish being printed, but at least the files are displayed.

I'm happy to raise a PR for this, but first want to check that this is an acceptable solution and that no one else has any better ideas.

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: We've already had a couple of discussions related to INT96. The main problem is that INT96 is not only deprecated but never specified properly. We don't want to encourage anyone to use it so we don't want to implement any additional support for this type.

Meanwhile, I understand your problem. One way to workaround this issue is to not use parquet-avro but another binding (e.g. parquet-thrift, parquet-protobuf etc.) that handles INT96. Though, I am not sure which one would handle it properly. If you want to fix this in parquet-avro, it would be better to hide this new behavior behind a flag. (You may check the related parts of this PR about introducing a new config.)

asfimport commented 4 years ago

Ben Watson: Thanks for the useful information Gabor. I'm not sure I want to add as much code as there is in that UUID PR - that's quite a lot of code for an unsupported, deprecated field.

I will take a look at other libraries, but I like parquet-avro because it lets me share code across my Avro and Parquet implementations, and deals with the JSON conversion nicely. I've had trouble finding an implementation that doesn't rely heavily on Hadoop Path objects or Spark, both of which cause a lot of problems within my plugin.

I have enabled INT96 support in my plugin by compiling parquet-avro myself with the one change I mentioned above, and I'm happy to stick to that solution going forwards. Thanks again.