Open asfimport opened 1 year ago
Antoine Pitrou / @pitrou: cc @julienledem @piyushnarang @rdblue @isnotinvain
Gang Wu / @wgtmac: IMHO, the specs is authoritative to the reader implementation to correctly read Parquet files created by different writers. But it is writer implementer's choice to fallback to any standard encoding. This is what the video coding standard does (e.g. H.264/AVC and H.265/HEVC).
What's more, the writer implementation can even rewrite the dictionary page and dictionary-encoded data pages to fallback encoding if fallback happens and discard the dictionary-encoded pages, just like what Apache ORC does. Mixing dictionary encoding and non-dictionary encoding in the same column chunk makes the implementation of features like reading dictionary and predicate pushdown much complicated.
cc [~shangx@uber.com]
Micah Kornfield / @emkornfield: I agree with @wgtmac here. I think we should probably have language like we've done in previous cases like "for maximum compatibility" but then say any mix of page encodings is valid as long as the ordering is valid.
In terms of mixing dictionary encodings with others, it does make things a little bit harder but I don't think we should make it not-allowed (but point out the potential benefits of unified encoding).
The spec for DICTIONARY_ENCODING states that:
https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
However, the parquet-mr implementation was deliberately changed to a different fallback mechanism in https://issues.apache.org/jira/browse/PARQUET-52
I'm assuming the parquet-mr implementation is authoritative here. But then the spec is incorrect and should be fixed to reflect expected behavior.
Reporter: Antoine Pitrou / @pitrou
Note: This issue was originally created as PARQUET-2221. Please see the migration documentation for further details.