Closed asfimport closed 4 months ago
Deepak Majeti / @majetideepak: The Parquet format is being extended with many new features such as indexes, correct statistics, etc. Having compatibility across various writers (parquet-mr, parquet-cpp, Impala, etc.) is very important for the community to trust/depend on the Parquet file format. We should discuss this Jira in our next sync and start working towards improving the compatibility.
Related discussion is here: https://github.com/apache/parquet-format/issues/441
Also, it seems like the parquet-testing repository contains example parquet files written with various different features so maybe that is enough to close this issue
cc @julienledem and @wgtmac
Agreed, the purpose of parquet-testing repo is exactly for interoperability test.
I think it would be nice to have a reference from the README to the parquet-testing repository. I've created a PR here: https://github.com/apache/parquet-format/pull/442
We should build a corpus of Parquet files that client implementations can use for validation. In addition to the input files, it should contain a description or a verbatim copy of the data in each file, so that readers can validate their results.
As a starting point we can look at the old parquet-compatibility repo and Impala's test data, in particular the Parquet files it contains.
Impala also has a tool to generate Parquet files from JSON files: https://github.com/apache/incubator-impala/blob/master/testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java
Arrow has a similar tool: https://github.com/apache/arrow/blob/master/integration/integration_test.py
Reporter: Lars Volker / @lekv
Related issues:
Note: This issue was originally created as PARQUET-1118. Please see the migration documentation for further details.