apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
665 stars 477 forks source link

ORC-1725: [C++] Fix BYTE statistics on platforms with unsigned 'char' #1947

Closed al13n321 closed 1 month ago

al13n321 commented 1 month ago

What changes were proposed in this pull request?

Calculate min/max statistics for columns of type BYTE using signed char instead of char type (C++).

Why are the changes needed?

char can be signed or unsigned, depending on target platform and compiler flags, but the statistics in the ORC file should always treat numbers as signed. Specifically, it was behaving incorrectly on ARM because char is unsigned there.

How was this patch tested?

Tested as part of ClickHouse here: https://github.com/ClickHouse/ClickHouse/pull/64563 . ~Let me know if I should add a unit test or something (though presumably there are already unit tests for this, they just don't run on ARM in CI?)~ Added a small test in TestWriter.cc reproducing the problem.

Was this patch authored or co-authored using generative AI tooling?

No

wgtmac commented 1 month ago

Thanks for the fix! Could you please use the original PR template?

wgtmac commented 1 month ago

cc @ffacs

al13n321 commented 1 month ago

Could you please use the original PR template?

Edited.

ffacs commented 1 month ago

Thank you for making the PR @al13n321. It seems there are no tests to cover the case. Could you please add a little unit test?

al13n321 commented 1 month ago

Added a minimal test. It fails without the fix as expected:

/home/ec2-user/ClickHouse/contrib/orc/c++/test/TestWriter.cc:458: Failure
Expected equality of these values:
  int_stats->getMinimum()
    Which is: 0
  -128
/home/ec2-user/ClickHouse/contrib/orc/c++/test/TestWriter.cc:459: Failure
Expected equality of these values:
  int_stats->getMaximum()
    Which is: 255
  127
/home/ec2-user/ClickHouse/contrib/orc/c++/test/TestWriter.cc:461: Failure
Expected equality of these values:
  int_stats->getSum()
    Which is: 8355585
  sum
    Which is: -32767

Would be good to properly test statistics, with all data types, row index statistics, random values, random stripe and row index sizes, etc, but that's outside the scope of this PR.