apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

Parquet compatibility / integration testing #441

Open alamb opened 2 days ago

alamb commented 2 days ago

See related mailing list discussion: https://lists.apache.org/thread/kd3k4q691lp5c4q3r767zb8jltrm9z33

Background

In https://github.com/apache/parquet-site/pull/34 we are adding an "implementation status" matrix for different paruqet implementations, to help people understand the supported feature sets of various parquet implementations across the ecosystem.

As we work to fill out this matrix for various parquet implementations, the question arises what does "supports a particular Parquet feature" mean, precisely?

One way to provide a precise definition is to provide a way to automate the check for each feature.

Prior Art

parquet-testing

The parquet-testing repository contains example parquet files written with various different features.

The README.md file contains brief descriptions of the contents of these files, but there is no machine readable description of the data contained within those files.

Apache Arrow

Apache Arrow has a similar feature chart: https://arrow.apache.org/docs/status.html

Screenshot 2024-06-26 at 10 36 05 AM

Part of maintaining this chart is a comprehensive integration suite which programtically checks if data created by one implementation of Arrow can be read by others.

The suite is implemented using a single integration tool called archery, maintained by the Arrow project in the apache/arrow-testing github repo. Each implementation of Arrow implements a driver program that accepts inputs / generates outputs in a known format and then archery orchestrates running that driver program

There are also a number of known "gold files" here which contain JSON representations of data stored in gold master arrow files

Note that Arrow is somewhat different than Parquet in that most of the Arrow implementations are maintained by the Apache Arrow project itself. In comparison, I believe most of the Parquet implementations are maintained by projects / teams other than Apache Parquet.

Options

Here are some ideas of what a Parquet compatibility test might look like

Option 1: integration harness similar to archery

In this case, an integration harness similar to archery would handled automatically verifying different implementations. This harness could do orchestrate workflows such as read gold parquet files, as well as write parquet data with one implementation and read it with another and verify their compatibility

Pros:

Cons:

Option 2: Add golden files to parquet-testing

In this option, we would add

  1. Add golden files to the parquet-testing repo (e.g. JSON formatted) coresponding to each existing .parquet file
  2. Document the format of the golden files
  3. To test supporting a feature on write, an implementation could verify their implementation could produce a .parquet file that when read made the same .golden file again

Each implementation could then check compatibility by creating their own driver program

This approach has a (very) rough prototype here: https://github.com/apache/arrow-rs/pull/5956

parquet-testing
|- data
|  |- README.md   # textual description of this contents of each file
|  |- all_types.plain.parquet
|  |- all_types.plain.parquet.json # JSON file with expected contents of all_types.plain.parquet
...

Pros:

Cons

Option 3: Add golden files and files written by other implementations to parquet-testing

@pitrou suggested what I think is an extension of option 2 on https://github.com/apache/arrow-rs/pull/5956#issuecomment-2191142596

My alternative proposal would be a directory tree with pre-generated integration files, something like:

parquet-integration
|- all_types.plain.uncompressed
|  |- README.md   # textual description of this integration scenario
|  |- parquet-java_1.0.pq  # file generated by parquet-java 1.0 for said scenario
|  |- parquet-java_2.5.pq  # file generated by parquet-java 2.5
|  |- parquet-cpp_16.0.1.pq  # file generated by parquet-cpp 16.0.1
|- all_types.dictionary.uncompressed
| ...

... which allows us to have many different scenarios without the scaling problem of having all implementations run within the same CI job.

The textual README.md could of course be supplemented by a machine-readable JSON format if there's a reasonable way to cover all expected variations with it.

I think this mechanism would allow for cross-implementation integration testing without requiring a unified harness

pitrou commented 2 days ago

Note that Arrow is somewhat different than Parquet in that most of the Arrow implementations are maintained by the Apache Arrow project itself. In comparison, I believe most of the Parquet implementations are maintained by projects / teams other than Apache Parquet.

Right, and also there might be some closed-source Parquet implementations significant enough to participate in integration testing? Cueing in more informed people @julienledem @wesm @gszadovszky .

There's also at least one GPU implementation in cuDF, which would complicate integration CI if all implementations had be to run together: https://docs.rapids.ai/api/cudf/stable/user_guide/10min/#reading-writing-parquet-files

alkis commented 2 days ago

Option 4: carpenter

The data in parquet-testing are a collection of PLAIN_ENCODED uncompressed parquet files and the carpenter binary.

The carpenter is responsible for invoking drivers from different implementations to test them against each other. carpenter also has the capability of diffing two parquet files for equivalence only when PLAIN_ENCODED and uncompressed.

Drivers read and write parquet files.

Reading is specified by a subset of columns to read and optionally push down filters. This allows column/row slicing of the original data. Writing is specified by potentially explicit selection of codecs per column.

Drivers are filters: parquet in, parquet out, where the output can have less than or equal data as the input and can be potentially encoded differently.

carpenters job is to generate slices of the input parquet files, generate "interesting" parquet files as output (encoded with different encodings) and read those with all other drivers to generate the canonical form (PLAIN_ENCODED uncompressed) parquet outputs. The outputs must all be the same.

Pros:

Cons:

Extra ideas on top of the above:

pitrou commented 2 days ago

@alkis

Cons:

  • carpenter has a bit of complexity - it needs to be able to decode a subset of parquet to verify equivalence
  • drivers need to be able to slice the data on both cols/rows to make extensive testing
etseidl commented 2 days ago

I'd personally prefer option 2. I agree with @pitrou that a centralized CI system that can cross validate all implementations will be very hard (and expensive) to realize. Self reporting for the purposes of the compatibility matrix is easiest for all, and cheats will be found out soon enough 😄.

Second best would be option 3, but I'm curious how often an implementation would be expected to provide files? The full set for each release, or just one for the earliest release that support the feature? The former could become quite unwieldy given the quick release cycles of some implementations.

alkis commented 2 days ago

Need to host all important implementations under a single CI job (including closed-source ones? including GPU ones?).

This is a good point. Does it apply to all options? If other options solve this by running some drivers manually/internally (outside of official CI), same solution can apply here too.

Need to fight to keep the CI execution times reasonable despite the inherent O(n^2) execution strategy (with n being the number of implementations under test). In Apache Arrow, we're down to ~25 minutes after a lot of grunt work (and when we are lucky with ccache etc.), but at times the execution time went up to ~1 hour. With Apache Parquet having more possible variations than Arrow, I'm skeptical this is a good approach.

I am not sure how Arrow does it. Could it be that this can be done better? Does archery track when drivers haven't changed and avoid re-testing/re-generating outputs? If driver binaries do not change very often (and they don't from my experience) this seems like it can avoid a ton of work.

Need to devise a way to specify all variations programmatically for the drivers to obey the carpenter... at the end, you probably either need some JSON representation anyway, or an adhoc replacement for it.

We need to specify what to output from drivers anyway so that work must be done in any option with a driver. What I am avoiding with my proposal is a way to describe rows data - instead I suggest we do that with parquet itself.

wesm commented 2 days ago

Right, and also there might be some closed-source Parquet implementations significant enough to participate in integration testing? Cueing in more informed people @julienledem @wesm @gszadovszky .

My guess is that both Snowflake and Databricks (for their fork of Spark) have made their own implementations, so those would be the two most mainstream platforms that we would want to try to get to participate in an integration testing matrix.

Some open source projects (e.g. Apache Impala -- though I'm not sure how many Impala users are out there anymore) have Parquet implementations within them, and it may be feasible to create a Docker-based setup to turn them into an integration test target (e.g. Ibis has a docker-compose configuration for running tests against Impala https://github.com/ibis-project/ibis/blob/main/compose.yaml).

alamb commented 2 days ago

I am pretty sure DuckDB has their own parquet implementation too that would likely be good to get represented: https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_reader.cpp

pitrou commented 2 days ago

Second best would be option 3, but I'm curious how often an implementation would be expected to provide files? The full set for each release, or just one for the earliest release that support the feature? The former could become quite unwieldy given the quick release cycles of some implementations.

I agree that mandating new files for each release isn't reasonable, and we might leave this up to each implementation. For example, they might decide to upload new files if important changes were made in the writer implementation that would yield significantly different binary output.

pitrou commented 2 days ago

Need to host all important implementations under a single CI job (including closed-source ones? including GPU ones?).

This is a good point. Does it apply to all options? If other options solve this by running some drivers manually/internally (outside of official CI), same solution can apply here too.

Well, option 3 is based on files being uploaded to a specific repo (or directory tree), so there's no need for implementations to run alongside each other in the same CI job.

I am not sure how Arrow does it. Could it be that this can be done better? Does archery track when drivers haven't changed and avoid re-testing/re-generating outputs? If driver binaries do not change very often (and they don't from my experience) this seems like it can avoid a ton of work.

No, it's based on a Docker setup and it's purely stateless (apart from the optional storage of compilation caching data). I agree there are certainly ways to make things more optimized, but each optimization adds a layer of complexity and fragility, especially if it involves implementations that are maintained independently from each other, and by different teams.

Also, it is useful for the worst case to remain reasonably short. For example, using a Docker stateless setup anyone can replicate the integration locally.

gszadovszky commented 2 days ago

Dremio also has its own (closed source) reader but it uses parquet-java for the write path. So, Dremio would highly benefit from the generated golden files, but currently it does not makes sense to provide additional ones.

emkornfield commented 1 day ago

IMO, I like both options 1 and 2. i think for 2, the core framework can be owned by the parquet community with documentation on how to integration custom readers (similar to archery and @alkis I like the name carpenter). For CI purposes I think having the tool run between C++/Java would be useful.