apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.81k stars 431 forks source link

[Format] Encoding spec incorrect for dictionary fallback #404

Open asfimport opened 1 year ago

asfimport commented 1 year ago

The spec for DICTIONARY_ENCODING states that:

If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding.

https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

However, the parquet-mr implementation was deliberately changed to a different fallback mechanism in https://issues.apache.org/jira/browse/PARQUET-52

I'm assuming the parquet-mr implementation is authoritative here. But then the spec is incorrect and should be fixed to reflect expected behavior.

Reporter: Antoine Pitrou / @pitrou

Note: This issue was originally created as PARQUET-2221. Please see the migration documentation for further details.

asfimport commented 1 year ago

Antoine Pitrou / @pitrou: cc @julienledem @piyushnarang @rdblue @isnotinvain

asfimport commented 1 year ago

Gang Wu / @wgtmac: IMHO, the specs is authoritative to the reader implementation to correctly read Parquet files created by different writers. But it is writer implementer's choice to fallback to any standard encoding. This is what the video coding standard does (e.g. H.264/AVC and H.265/HEVC).

What's more, the writer implementation can even rewrite the dictionary page and dictionary-encoded data pages to fallback encoding if fallback happens and discard the dictionary-encoded pages, just like what Apache ORC does. Mixing dictionary encoding and non-dictionary encoding in the same column chunk makes the implementation of features like reading dictionary and predicate pushdown much complicated.

cc [~shangx@uber.com]

asfimport commented 12 months ago

Micah Kornfield / @emkornfield: I agree with @wgtmac here.  I think we should probably have language like we've done in previous cases like "for maximum compatibility" but then say any mix of page encodings is valid as long as the ordering is valid.

 

In terms of mixing dictionary encodings with others, it does make things a little bit harder but I don't think we should make it not-allowed (but point out the potential benefits of unified encoding).