Open alamb opened 5 months ago
Note that Arrow is somewhat different than Parquet in that most of the Arrow implementations are maintained by the Apache Arrow project itself. In comparison, I believe most of the Parquet implementations are maintained by projects / teams other than Apache Parquet.
Right, and also there might be some closed-source Parquet implementations significant enough to participate in integration testing? Cueing in more informed people @julienledem @wesm @gszadovszky .
There's also at least one GPU implementation in cuDF, which would complicate integration CI if all implementations had be to run together: https://docs.rapids.ai/api/cudf/stable/user_guide/10min/#reading-writing-parquet-files
carpenter
The data in parquet-testing
are a collection of PLAIN_ENCODED
uncompressed parquet files and the carpenter
binary.
The carpenter
is responsible for invoking drivers from different implementations to test them against each other. carpenter
also has the capability of diffing two parquet files for equivalence only when PLAIN_ENCODED
and uncompressed.
Drivers read and write parquet files.
Reading is specified by a subset of columns to read and optionally push down filters. This allows column/row slicing of the original data. Writing is specified by potentially explicit selection of codecs per column.
Drivers are filters: parquet in, parquet out, where the output can have less than or equal data as the input and can be potentially encoded differently.
carpenter
s job is to generate slices of the input parquet files, generate "interesting" parquet files as output (encoded with different encodings) and read those with all other drivers to generate the canonical form (PLAIN_ENCODED
uncompressed) parquet outputs. The outputs must all be the same.
Pros:
Cons:
carpenter
has a bit of complexity - it needs to be able to decode a subset of parquet to verify equivalenceExtra ideas on top of the above:
carpenter
can bisect the cols of the parquet to find which column is not decodable. This can make bug detection automatic. It can even generate a matrix of across all driverscarpenter
can be a benchmarking harness as well, by invoking the drivers with push down or partial col selection@alkis
Cons:
carpenter
has a bit of complexity - it needs to be able to decode a subset of parquet to verify equivalence- drivers need to be able to slice the data on both cols/rows to make extensive testing
O(n^2)
execution strategy (with n
being the number of implementations under test). In Apache Arrow, we're down to ~25 minutes after a lot of grunt work (and when we are lucky with ccache
etc.), but at times the execution time went up to ~1 hour. With Apache Parquet having more possible variations than Arrow, I'm skeptical this is a good approach.I'd personally prefer option 2. I agree with @pitrou that a centralized CI system that can cross validate all implementations will be very hard (and expensive) to realize. Self reporting for the purposes of the compatibility matrix is easiest for all, and cheats will be found out soon enough 😄.
Second best would be option 3, but I'm curious how often an implementation would be expected to provide files? The full set for each release, or just one for the earliest release that support the feature? The former could become quite unwieldy given the quick release cycles of some implementations.
Need to host all important implementations under a single CI job (including closed-source ones? including GPU ones?).
This is a good point. Does it apply to all options? If other options solve this by running some drivers manually/internally (outside of official CI), same solution can apply here too.
Need to fight to keep the CI execution times reasonable despite the inherent O(n^2) execution strategy (with n being the number of implementations under test). In Apache Arrow, we're down to ~25 minutes after a lot of grunt work (and when we are lucky with ccache etc.), but at times the execution time went up to ~1 hour. With Apache Parquet having more possible variations than Arrow, I'm skeptical this is a good approach.
I am not sure how Arrow does it. Could it be that this can be done better? Does archery
track when drivers haven't changed and avoid re-testing/re-generating outputs? If driver binaries do not change very often (and they don't from my experience) this seems like it can avoid a ton of work.
Need to devise a way to specify all variations programmatically for the drivers to obey the carpenter... at the end, you probably either need some JSON representation anyway, or an adhoc replacement for it.
We need to specify what to output from drivers anyway so that work must be done in any option with a driver. What I am avoiding with my proposal is a way to describe rows data - instead I suggest we do that with parquet itself.
Right, and also there might be some closed-source Parquet implementations significant enough to participate in integration testing? Cueing in more informed people @julienledem @wesm @gszadovszky .
My guess is that both Snowflake and Databricks (for their fork of Spark) have made their own implementations, so those would be the two most mainstream platforms that we would want to try to get to participate in an integration testing matrix.
Some open source projects (e.g. Apache Impala -- though I'm not sure how many Impala users are out there anymore) have Parquet implementations within them, and it may be feasible to create a Docker-based setup to turn them into an integration test target (e.g. Ibis has a docker-compose configuration for running tests against Impala https://github.com/ibis-project/ibis/blob/main/compose.yaml).
I am pretty sure DuckDB has their own parquet implementation too that would likely be good to get represented: https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_reader.cpp
Second best would be option 3, but I'm curious how often an implementation would be expected to provide files? The full set for each release, or just one for the earliest release that support the feature? The former could become quite unwieldy given the quick release cycles of some implementations.
I agree that mandating new files for each release isn't reasonable, and we might leave this up to each implementation. For example, they might decide to upload new files if important changes were made in the writer implementation that would yield significantly different binary output.
Need to host all important implementations under a single CI job (including closed-source ones? including GPU ones?).
This is a good point. Does it apply to all options? If other options solve this by running some drivers manually/internally (outside of official CI), same solution can apply here too.
Well, option 3 is based on files being uploaded to a specific repo (or directory tree), so there's no need for implementations to run alongside each other in the same CI job.
I am not sure how Arrow does it. Could it be that this can be done better? Does
archery
track when drivers haven't changed and avoid re-testing/re-generating outputs? If driver binaries do not change very often (and they don't from my experience) this seems like it can avoid a ton of work.
No, it's based on a Docker setup and it's purely stateless (apart from the optional storage of compilation caching data). I agree there are certainly ways to make things more optimized, but each optimization adds a layer of complexity and fragility, especially if it involves implementations that are maintained independently from each other, and by different teams.
Also, it is useful for the worst case to remain reasonably short. For example, using a Docker stateless setup anyone can replicate the integration locally.
Dremio also has its own (closed source) reader but it uses parquet-java for the write path. So, Dremio would highly benefit from the generated golden files, but currently it does not makes sense to provide additional ones.
IMO, I like both options 1 and 2. i think for 2, the core framework can be owned by the parquet community with documentation on how to integration custom readers (similar to archery and @alkis I like the name carpenter). For CI purposes I think having the tool run between C++/Java would be useful.
See related mailing list discussion: https://lists.apache.org/thread/kd3k4q691lp5c4q3r767zb8jltrm9z33
Background
In https://github.com/apache/parquet-site/pull/34 we are adding an "implementation status" matrix for different paruqet implementations, to help people understand the supported feature sets of various parquet implementations across the ecosystem.
As we work to fill out this matrix for various parquet implementations, the question arises what does "supports a particular Parquet feature" mean, precisely?
One way to provide a precise definition is to provide a way to automate the check for each feature.
Prior Art
parquet-testing
The parquet-testing repository contains example parquet files written with various different features.
The
README.md
file contains brief descriptions of the contents of these files, but there is no machine readable description of the data contained within those files.Apache Arrow
Apache Arrow has a similar feature chart: https://arrow.apache.org/docs/status.html
Part of maintaining this chart is a comprehensive integration suite which programtically checks if data created by one implementation of Arrow can be read by others.
The suite is implemented using a single integration tool called
archery
, maintained by the Arrow project in the apache/arrow-testing github repo. Each implementation of Arrow implements a driver program that accepts inputs / generates outputs in a known format and then archery orchestrates running that driver programThere are also a number of known "gold files" here which contain JSON representations of data stored in gold master arrow files
Note that Arrow is somewhat different than Parquet in that most of the Arrow implementations are maintained by the Apache Arrow project itself. In comparison, I believe most of the Parquet implementations are maintained by projects / teams other than Apache Parquet.
Options
Here are some ideas of what a Parquet compatibility test might look like
Option 1: integration harness similar to
archery
In this case, an integration harness similar to
archery
would handled automatically verifying different implementations. This harness could do orchestrate workflows such as read gold parquet files, as well as write parquet data with one implementation and read it with another and verify their compatibilityPros:
Cons:
Option 2: Add golden files to
parquet-testing
In this option, we would add
golden
files to theparquet-testing
repo (e.g. JSON formatted) coresponding to each existing .parquet fileEach implementation could then check compatibility by creating their own driver program
This approach has a (very) rough prototype here: https://github.com/apache/arrow-rs/pull/5956
Pros:
Cons
Option 3: Add golden files and files written by other implementations to
parquet-testing
@pitrou suggested what I think is an extension of option 2 on https://github.com/apache/arrow-rs/pull/5956#issuecomment-2191142596
I think this mechanism would allow for cross-implementation integration testing without requiring a unified harness