apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.8k stars 432 forks source link

Build a corpus of Parquet files that client implementations can use for validation #273

Closed asfimport closed 4 months ago

asfimport commented 7 years ago

We should build a corpus of Parquet files that client implementations can use for validation. In addition to the input files, it should contain a description or a verbatim copy of the data in each file, so that readers can validate their results.

As a starting point we can look at the old parquet-compatibility repo and Impala's test data, in particular the Parquet files it contains.

$ find testdata | grep -i parq
testdata/workloads/tpch/queries/insert_parquet.test
testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
testdata/workloads/functional-query/queries/QueryTest/parquet-zero-rows.test
testdata/workloads/functional-query/queries/QueryTest/insert_parquet_invalid_codec.test
testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test
testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test
testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
testdata/workloads/functional-query/queries/QueryTest/parquet-abort-on-error.test
testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet.test
testdata/workloads/functional-query/queries/QueryTest/parquet.test
testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet-nested.test
testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
testdata/max_nesting_depth/int_map/file.parq
testdata/max_nesting_depth/struct/file.parq
testdata/max_nesting_depth/struct_map/file.parq
testdata/max_nesting_depth/int_array/file.parq
testdata/max_nesting_depth/struct_array/file.parq
testdata/parquet_nested_types_encodings
testdata/parquet_nested_types_encodings/README
testdata/parquet_nested_types_encodings/UnannotatedListOfGroups.parquet
testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet
testdata/parquet_nested_types_encodings/UnannotatedListOfPrimitives.parquet
testdata/parquet_nested_types_encodings/AmbiguousList.json
testdata/parquet_nested_types_encodings/AvroPrimitiveInList.parquet
testdata/parquet_nested_types_encodings/ThriftPrimitiveInList.parquet
testdata/parquet_nested_types_encodings/bad-avro.parquet
testdata/parquet_nested_types_encodings/AmbiguousList.avsc
testdata/parquet_nested_types_encodings/SingleFieldGroupInList.parquet
testdata/parquet_nested_types_encodings/ThriftSingleFieldGroupInList.parquet
testdata/parquet_nested_types_encodings/AvroSingleFieldGroupInList.parquet
testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet
testdata/parquet_nested_types_encodings/bad-thrift.parquet
testdata/ComplexTypesTbl/nonnullable.parq
testdata/ComplexTypesTbl/nullable.parq
testdata/bad_parquet_data
testdata/bad_parquet_data/README
testdata/bad_parquet_data/dict-encoded-out-of-bounds.parq
testdata/bad_parquet_data/plain-encoded-negative-len.parq
testdata/bad_parquet_data/plain-encoded-out-of-bounds.parq
testdata/bad_parquet_data/dict-encoded-negative-len.parq
testdata/parquet_schema_resolution
testdata/parquet_schema_resolution/README
testdata/parquet_schema_resolution/switched_map.json
testdata/parquet_schema_resolution/switched_map.avsc
testdata/parquet_schema_resolution/switched_map.parq
testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java
testdata/LineItemMultiBlock/lineitem_one_row_group.parquet
testdata/LineItemMultiBlock/lineitem_sixblocks.parquet
testdata/data/zero_rows_zero_row_groups.parquet
testdata/data/chars-formats.parquet
testdata/data/multiple_rowgroups.parquet
testdata/data/bad_parquet_data.parquet
testdata/data/bad_metadata_len.parquet
testdata/data/huge_num_rows.parquet
testdata/data/bad_compressed_size.parquet
testdata/data/zero_rows_one_row_group.parquet
testdata/data/bad_rle_repeat_count.parquet
testdata/data/bad_column_metadata.parquet
testdata/data/alltypesagg_hive_13_1.parquet
testdata/data/bad_dict_page_offset.parquet
testdata/data/bad_rle_literal_count.parquet
testdata/data/bad_magic_number.parquet
testdata/data/repeated_values.parquet
testdata/data/schemas/malformed_decimal_tiny.parquet
testdata/data/schemas/alltypestiny.parquet
testdata/data/schemas/nested/modern_nested.parquet
testdata/data/schemas/nested/legacy_nested.parquet
testdata/data/schemas/enum/enum.parquet
testdata/data/schemas/decimal.parquet
testdata/data/schemas/zipcode_incomes.parquet
testdata/data/repeated_root_schema.parquet
testdata/data/long_page_header.parquet
testdata/data/deprecated_statistics.parquet
testdata/data/kite_required_fields.parquet
testdata/data/out_of_range_timestamp.parquet

Impala also has a tool to generate Parquet files from JSON files: https://github.com/apache/incubator-impala/blob/master/testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java

Arrow has a similar tool: https://github.com/apache/arrow/blob/master/integration/integration_test.py

Reporter: Lars Volker / @lekv

Related issues:

Note: This issue was originally created as PARQUET-1118. Please see the migration documentation for further details.

asfimport commented 5 years ago

Deepak Majeti / @majetideepak: The Parquet format is being extended with many new features such as indexes, correct statistics, etc. Having compatibility across various writers (parquet-mr, parquet-cpp, Impala, etc.) is very important for the community to trust/depend on the Parquet file format. We should discuss this Jira in our next sync and start working towards improving the compatibility.

alamb commented 4 months ago

Related discussion is here: https://github.com/apache/parquet-format/issues/441

Also, it seems like the parquet-testing repository contains example parquet files written with various different features so maybe that is enough to close this issue

cc @julienledem and @wgtmac

wgtmac commented 4 months ago

Agreed, the purpose of parquet-testing repo is exactly for interoperability test.

Fokko commented 4 months ago

I think it would be nice to have a reference from the README to the parquet-testing repository. I've created a PR here: https://github.com/apache/parquet-format/pull/442