Open asfimport opened 6 years ago
Gabor Szadovszky / @gszadovszky: @vrozov,
In 1.10.0 we introduced new min/max statistics. See [PARQUET-1025] for details. The code part you are trying to fix is related to the old values only. The problem is that while the INT32 and INT64 representations of DECIMAL worked just fine (signed comparison of the whole 32/64bit numbers) the FIXED_LEN_BYTE_ARRAY and the BINARY representation did not (signed comparison of the bytes one-by-one lexicographically).
So, we can differentiate the two types of representations when checking the sorting order and write the old statistics for the INT32 and INT64 representations of DECIMAL. But, do we want to bother improving the writing of the old statistics values which are already deprecated? The new values support (almost) all the types properly.
Vlad Rozov / @vrozov: I agree that it is not worth fixing the write code path for the old DECIMAL statistics if the new min/max is available. IMO, it is still necessary to fix the code path for reading DECIMAL statistics old values when new values are not available assuming that old values are correct.
Gabor Szadovszky / @gszadovszky: Currently, we are not reading the old statistics values for DECIMAL even in cases when the new ones are not filled. (We are reading/writing the old values only if sortOrder(PrimitiveType) is SIGNED) Do you have a unit test or scenario where you can prove it does not work correctly?
Vlad Rozov / @vrozov: Should statistics be available for DECIMAL if it is encoded as INT32/INT64? For FIXED_LEN_BYTE_ARRAY and BINARY, was BigInteger/BigDecimal used for comparison when old statistics were written or it was compared as byte[]?
Gabor Szadovszky / @gszadovszky: For the INT32/INT64 we can use the old values as the originally implement comparison was the default signed comparison for the java int and long types. It should work fine, so if you like, you can implement it. This would be more an improvement than a critical bugfix, though. For the old values of Binary types (FIXED, BINARY and INT96) we had been using signed lexicographical comparison byte-by-byte which is not correct for either strings (where at least unsigned lexicographical comparison would be required) or decimal (where a BigInteger like comparison would be required). So, we shall not use nor write the old statistics values for binary types. (If you are interested in the new comparison logic, see the class PrimitiveComparator for details.)
Vlad Rozov / @vrozov: As statistics for DECIMAL encoded as INT32/INT64 were available prior to PARQUET-686 fix, I would still consider this as a bug, but I do agree that it is not critical. I'll try to put together a fix.
According to parquet format specification columns annotated as DECIMAL should use SIGNED comparator and statistics should be available. The sort order returned by
org.apache.parquet.format.converter.ParquetMetadataConverter
for DECIMAL isSortOrder.UNKNOWN
which contradicts the specification and makes statistics for DECIMAL types unavailable.Reporter: Vlad Rozov / @vrozov Assignee: Vlad Rozov / @vrozov
Related issues:
PRs and other links:
Note: This issue was originally created as PARQUET-1322. Please see the migration documentation for further details.