Closed alkis closed 2 months ago
I marked this ready for review. I have tested offline that this method works as expected and has virtually no impact in parse speed of the original FileMetaData
thrift message.
How can we move this forward?
Hi, can we perhaps find another term than "extension" for this feature? The reason is that we may also want to add support for extension types to Parquet (where "extension types" fits naturally with the corresponding feature in Apache Arrow), which might breed confusion when talking about "Parquet extensions". https://lists.apache.org/thread/lcrjoymddllxf8wvq33ddm5fd0tk9mh7
(but perhaps I'm being overly cautious here, what do you think?)
Hi, can we perhaps find another term than "extension" for this feature? The reason is that we may also want to add support for extension types to Parquet (where "extension types" fits naturally with the corresponding feature in Apache Arrow), which might breed confusion when talking about "Parquet extensions". https://lists.apache.org/thread/lcrjoymddllxf8wvq33ddm5fd0tk9mh7
(but perhaps I'm being overly cautious here, what do you think?)
It would be beneficial to avoid confusion. I can't think of a great name for it though. If we name them "experiments" it will make them look risky and can reduce adoption when we use them for migration. "plugins" - I don't like it either. Got any suggestions?
For the extensions in the discussion, I do not have a lot of context: what's the advantage of such extensions vs adding the logical types to the spec?
For the extensions in the discussion, I do not have a lot of context: what's the advantage of such extensions vs adding the logical types to the spec?
The point is to decouple the Thrift / Format spec from the definition of new extension (logical) types. It allows third-party definitions of such types and gradual standardization thereof (either inside the Apache Parquet project itself, or by consensus inside a subcommunity). It also makes it easier for the implementations of said types to live outside of the core Parquet implementations, which is desirable for complex and/or highly-specific domains such as with GeoParquet.
(personally, I would rather Parquet C++ didn't have to reimplement logic for geospatial data :-))
It would be beneficial to avoid confusion. I can't think of a great name for it though. If we name them "experiments" it will make them look risky and can reduce adoption when we use them for migration. "plugins" - I don't like it either. Got any suggestions?
"Thrift protocol escapes"? "Thrift binary appends"?
The point is to decouple the Thrift / Format spec from the definition of new extension (logical) types. It allows third-party definitions of such types and gradual standardization thereof (either inside the Apache Parquet project itself, or by consensus inside a subcommunity). It also makes it easier for the implementations of said types to live outside of the core Parquet implementations, which is desirable for complex and/or highly-specific domains such as with GeoParquet.
(personally, I would rather Parquet C++ didn't have to reimplement logic for geospatial data :-))
Makes sense. Should we name these extensions type-extensions or third-party-types or external-types? Should we name the extensions in this PR format-extensions (not sure if this is specific enough though).
The reason to prefer "extension types" over "third-party types" or "external types" is that at some point some of them might get standardized inside Parquet, like Arrow does.
Though we could also dictate a policy that standardizing an extension type is done by creating a new logical type. @wgtmac
@pitrou wdyt about calling the extensions in this PR "Metadata Extensions" vs the other ones "Type Extensions"? Would that clear it?
Bonus is that if/when we add flatbuffers the extension points for flatbuffers will also fall under "Metadata Extensions".
The reason to prefer "extension types" over "third-party types" or "external types" is that at some point some of them might get standardized inside Parquet, like Arrow does.
Though we could also dictate a policy that standardizing an extension type is done by creating a new logical type. @wgtmac
Agreed. I think we can simply follow the rule of Arrow's extension type. A naive proposal would be adding a custom key-value metadata field to struct SchemaElement
. WDYT?
@pitrou wdyt about calling the extensions in this PR "Metadata Extensions" vs the other ones "Type Extensions"? Would that clear it?
Perhaps "Binary protocol extensions" or "Unparsed protocol extensions"? "Metadata" is usually vague.
Perhaps "Binary protocol extensions" or "Unparsed protocol extensions"? "Metadata" is usually vague.
Qualified as "Binary Protocol Extensions".
Friendly ping. Since there are no other comments can we merge this?
Doesn't this require a vote before merging?
Doesn't this require a vote before merging?
Started one here: https://lists.apache.org/thread/x3472kldrq5kjnld9ztj1jozz25f40hg
[INFO] Rat check: Summary over all files. Unapproved: 1, unknown: 1, generated: 0, approved: 28 licenses.
Warning: Files with unapproved licenses:
BinaryProtocolExtensions.md
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 11.651 s
[INFO] Finished at: 2024-08-22T15:38:07Z
[INFO] ------------------------------------------------------------------------
Error: Failed to execute goal org.apache.rat:apache-rat-plugin:0.16.1:check (default) on project parquet-format: Too many files with unapproved license: 1 See RAT report in: /home/runner/work/parquet-format/parquet-format/target/rat.txt -> [Help 1]
Error:
Error: To see the full stack trace of the errors, re-run Maven with the -e switch.
Error: Re-run Maven using the -X switch to enable full debug logging.
Error:
Error: For more information about the errors and possible solutions, please read the following articles:
Error: [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
Error: Process completed with exit code 1.
It seems that license header should be added before merging.
It seems that license header should be added before merging.
Added.
As the vote has passed, I will merge it if no objection or feedback received before Sep 12.
@wgtmac are we merging this?
@wgtmac are we merging this?
I want to make sure @pitrou is happy with the change.
I want to make sure @pitrou is happy with the change.
He had one suggestion which I accepted.
Specify a backwards/forward compatible way to extend any Thrift struct in Parquet.
ref Parquet binary protocol extensions
Jira
Commits
Documentation