apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.76k stars 428 forks source link

PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197

Closed emkornfield closed 10 months ago

emkornfield commented 11 months ago

Thanks @etseidl for verifying two implementations!

As they are still in the PoC state, I think the manual verification is sufficient and prefer delaying the work on interoperability by adding parquet files with SizeStatistics to the parquet-testing repo. We can add testing files after the implementations are formally reviewed and merged.

WDYT? @emkornfield

I think this is sufficient for now. @wgtmac if we think the Java PR is close enough I can start a vote for approval on this PR unless you want to do it?

wgtmac commented 11 months ago

I think it is good time to start a vote. @emkornfield

I will take care of the release process once the proposal passes the vote :)

emkornfield commented 11 months ago

Started vote on mailing list.

wgtmac commented 10 months ago

Could you help the vote on mailing list: https://lists.apache.org/thread/wgobz41mfldbhqpg9q4mdwypghg2cxg2? Help is needed from the PMC members. @ggershinsky @gszadovszky @julienledem @rdblue @wesm @xhochy

emkornfield commented 10 months ago

The vote passed. Is the standard way of merging her using a squash commit?

Fokko commented 10 months ago

@emkornfield I don't think there is a formal agreement on it on the Parquet project. squash is my personal preference and that's what the pre-Github script is doing as well: https://github.com/apache/parquet-format/blob/6f3f909ef410852713cc3865aa72e65eb21f9323/dev/merge_parquet_pr.py#L139