apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.81k stars 432 forks source link

PARQUET-2492: Add binary protocol extensions #254

Closed alkis closed 2 months ago

alkis commented 5 months ago

Specify a backwards/forward compatible way to extend any Thrift struct in Parquet.

ref Parquet binary protocol extensions

Jira

Commits

Documentation

alkis commented 5 months ago

I marked this ready for review. I have tested offline that this method works as expected and has virtually no impact in parse speed of the original FileMetaData thrift message.

How can we move this forward?

pitrou commented 3 months ago

Hi, can we perhaps find another term than "extension" for this feature? The reason is that we may also want to add support for extension types to Parquet (where "extension types" fits naturally with the corresponding feature in Apache Arrow), which might breed confusion when talking about "Parquet extensions". https://lists.apache.org/thread/lcrjoymddllxf8wvq33ddm5fd0tk9mh7

(but perhaps I'm being overly cautious here, what do you think?)

alkis commented 3 months ago

Hi, can we perhaps find another term than "extension" for this feature? The reason is that we may also want to add support for extension types to Parquet (where "extension types" fits naturally with the corresponding feature in Apache Arrow), which might breed confusion when talking about "Parquet extensions". https://lists.apache.org/thread/lcrjoymddllxf8wvq33ddm5fd0tk9mh7

(but perhaps I'm being overly cautious here, what do you think?)

It would be beneficial to avoid confusion. I can't think of a great name for it though. If we name them "experiments" it will make them look risky and can reduce adoption when we use them for migration. "plugins" - I don't like it either. Got any suggestions?

For the extensions in the discussion, I do not have a lot of context: what's the advantage of such extensions vs adding the logical types to the spec?

pitrou commented 3 months ago

For the extensions in the discussion, I do not have a lot of context: what's the advantage of such extensions vs adding the logical types to the spec?

The point is to decouple the Thrift / Format spec from the definition of new extension (logical) types. It allows third-party definitions of such types and gradual standardization thereof (either inside the Apache Parquet project itself, or by consensus inside a subcommunity). It also makes it easier for the implementations of said types to live outside of the core Parquet implementations, which is desirable for complex and/or highly-specific domains such as with GeoParquet.

(personally, I would rather Parquet C++ didn't have to reimplement logic for geospatial data :-))

pitrou commented 3 months ago

It would be beneficial to avoid confusion. I can't think of a great name for it though. If we name them "experiments" it will make them look risky and can reduce adoption when we use them for migration. "plugins" - I don't like it either. Got any suggestions?

"Thrift protocol escapes"? "Thrift binary appends"?

alkis commented 3 months ago

The point is to decouple the Thrift / Format spec from the definition of new extension (logical) types. It allows third-party definitions of such types and gradual standardization thereof (either inside the Apache Parquet project itself, or by consensus inside a subcommunity). It also makes it easier for the implementations of said types to live outside of the core Parquet implementations, which is desirable for complex and/or highly-specific domains such as with GeoParquet.

(personally, I would rather Parquet C++ didn't have to reimplement logic for geospatial data :-))

Makes sense. Should we name these extensions type-extensions or third-party-types or external-types? Should we name the extensions in this PR format-extensions (not sure if this is specific enough though).

pitrou commented 3 months ago

The reason to prefer "extension types" over "third-party types" or "external types" is that at some point some of them might get standardized inside Parquet, like Arrow does.

Though we could also dictate a policy that standardizing an extension type is done by creating a new logical type. @wgtmac

alkis commented 3 months ago

@pitrou wdyt about calling the extensions in this PR "Metadata Extensions" vs the other ones "Type Extensions"? Would that clear it?

Bonus is that if/when we add flatbuffers the extension points for flatbuffers will also fall under "Metadata Extensions".

wgtmac commented 3 months ago

The reason to prefer "extension types" over "third-party types" or "external types" is that at some point some of them might get standardized inside Parquet, like Arrow does.

Though we could also dictate a policy that standardizing an extension type is done by creating a new logical type. @wgtmac

Agreed. I think we can simply follow the rule of Arrow's extension type. A naive proposal would be adding a custom key-value metadata field to struct SchemaElement. WDYT?

pitrou commented 3 months ago

@pitrou wdyt about calling the extensions in this PR "Metadata Extensions" vs the other ones "Type Extensions"? Would that clear it?

Perhaps "Binary protocol extensions" or "Unparsed protocol extensions"? "Metadata" is usually vague.

alkis commented 3 months ago

Perhaps "Binary protocol extensions" or "Unparsed protocol extensions"? "Metadata" is usually vague.

Qualified as "Binary Protocol Extensions".

alkis commented 2 months ago

Friendly ping. Since there are no other comments can we merge this?

pitrou commented 2 months ago

Doesn't this require a vote before merging?

alkis commented 2 months ago

Doesn't this require a vote before merging?

Started one here: https://lists.apache.org/thread/x3472kldrq5kjnld9ztj1jozz25f40hg

wgtmac commented 2 months ago
[INFO] Rat check: Summary over all files. Unapproved: 1, unknown: 1, generated: 0, approved: 28 licenses.
Warning:  Files with unapproved licenses:
  BinaryProtocolExtensions.md

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11.651 s
[INFO] Finished at: 2024-08-22T15:38:07Z
[INFO] ------------------------------------------------------------------------
Error:  Failed to execute goal org.apache.rat:apache-rat-plugin:0.16.1:check (default) on project parquet-format: Too many files with unapproved license: 1 See RAT report in: /home/runner/work/parquet-format/parquet-format/target/rat.txt -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
Error: Process completed with exit code 1.

It seems that license header should be added before merging.

alkis commented 2 months ago

It seems that license header should be added before merging.

Added.

wgtmac commented 2 months ago

As the vote has passed, I will merge it if no objection or feedback received before Sep 12.

alkis commented 2 months ago

@wgtmac are we merging this?

wgtmac commented 2 months ago

@wgtmac are we merging this?

I want to make sure @pitrou is happy with the change.

alkis commented 2 months ago

I want to make sure @pitrou is happy with the change.

He had one suggestion which I accepted.