apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.55k stars 1.39k forks source link

Statistics is not available for DECIMAL types #2184

Open asfimport opened 6 years ago

asfimport commented 6 years ago

According to parquet format specification columns annotated as DECIMAL should use SIGNED comparator and statistics should be available. The sort order returned by org.apache.parquet.format.converter.ParquetMetadataConverter for DECIMAL is SortOrder.UNKNOWN which contradicts the specification and makes statistics for DECIMAL types unavailable.

Reporter: Vlad Rozov / @vrozov Assignee: Vlad Rozov / @vrozov

Related issues:

Note: This issue was originally created as PARQUET-1322. Please see the migration documentation for further details.

asfimport commented 6 years ago

Gabor Szadovszky / @gszadovszky: @vrozov,

In 1.10.0 we introduced new min/max statistics. See [PARQUET-1025] for details. The code part you are trying to fix is related to the old values only. The problem is that while the INT32 and INT64 representations of DECIMAL worked just fine (signed comparison of the whole 32/64bit numbers) the FIXED_LEN_BYTE_ARRAY and the BINARY representation did not (signed comparison of the bytes one-by-one lexicographically).

So, we can differentiate the two types of representations when checking the sorting order and write the old statistics for the INT32 and INT64 representations of DECIMAL. But, do we want to bother improving the writing of the old statistics values which are already deprecated? The new values support (almost) all the types properly.

asfimport commented 6 years ago

Vlad Rozov / @vrozov: I agree that it is not worth fixing the write code path for the old DECIMAL statistics if the new min/max is available. IMO, it is still necessary to fix the code path for reading DECIMAL statistics old values when new values are not available assuming that old values are correct.

asfimport commented 6 years ago

Gabor Szadovszky / @gszadovszky: Currently, we are not reading the old statistics values for DECIMAL even in cases when the new ones are not filled. (We are reading/writing the old values only if sortOrder(PrimitiveType) is SIGNED) Do you have a unit test or scenario where you can prove it does not work correctly?

asfimport commented 6 years ago

Vlad Rozov / @vrozov: Should statistics be available for DECIMAL if it is encoded as INT32/INT64? For FIXED_LEN_BYTE_ARRAY and BINARY, was BigInteger/BigDecimal used for comparison when old statistics were written or it was compared as byte[]?

asfimport commented 6 years ago

Gabor Szadovszky / @gszadovszky: For the INT32/INT64 we can use the old values as the originally implement comparison was the default signed comparison for the java int and long types. It should work fine, so if you like, you can implement it. This would be more an improvement than a critical bugfix, though. For the old values of Binary types (FIXED, BINARY and INT96) we had been using signed lexicographical comparison byte-by-byte which is not correct for either strings (where at least unsigned lexicographical comparison would be required) or decimal (where a BigInteger like comparison would be required). So, we shall not use nor write the old statistics values for binary types. (If you are interested in the new comparison logic, see the class PrimitiveComparator for details.)

asfimport commented 6 years ago

Vlad Rozov / @vrozov: As statistics for DECIMAL encoded as INT32/INT64 were available prior to PARQUET-686 fix, I would still consider this as a bug, but I do agree that it is not critical. I'll try to put together a fix.