apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
689 stars 483 forks source link

ORC-1553: Row grouping reads should be skipped when the statistics are written without any values for the SArg column #1692

Closed guiyanakuang closed 10 months ago

guiyanakuang commented 10 months ago

What changes were proposed in this pull request?

This PR aims to fix an issue where the column statistics were incorrectly evaluated in scenarios where no values were written, resulting in the inability to skip row groups.

Why are the changes needed?

The fix improves the evaluation logic of statistics, enabling the skipping of row groups that don't need to be read, thus enhancing performance.

How was this patch tested?

Unit tests have been added to validate the changes.

guiyanakuang commented 10 months ago

@neopaf Can you test if this pr works for your data?

guiyanakuang commented 10 months ago

cc @dongjoon-hyun @wgtmac

dongjoon-hyun commented 10 months ago

Do you still have concerns, @wgtmac ?

dongjoon-hyun commented 10 months ago

Let me merge this to be considered as a part of Apache ORC 1.9.x and 2.0.0. We can revert this if there is any issues during the release cycles~