apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
689 stars 483 forks source link

ORC-1580: Change default DataBuffer constructor to use reserve instead of resize #1739

Closed XinyuZeng closed 9 months ago

XinyuZeng commented 10 months ago

What changes were proposed in this pull request?

See https://github.com/apache/orc/issues/1430 - default DataBuffer constructor now uses reserve.

I also add a zeroOut function to manually memset the buffer. For example, MapColumnWrite::add may use uninitialized offsets value (at uint64_t elemOffset = static_cast<uint64_t>(offsets[0]);). i.e., not memset the offsets buffer will result in segfault because the code implicitly assume the value should be 0. Therefore I still zeroOut the buffer for Map, List and Union, even though List and Union pass the unit tests without memset.

Why are the changes needed?

Reduce overhead in ColumnVectorBatch construction and compression/decompression buffer construction

How was this patch tested?

Correctness via make test-out. I further use orc-scan utility, setting batchSize to 1 million and then scan a file with 1 million rows and 20 columns. After this PR, the time reduces from 2.9s to 2.2s. The compression/decompression buffer overhead reported in #1430 can be verified by the reporter if necessary.

Was this patch authored or co-authored using generative AI tooling?

No

wgtmac commented 9 months ago

cc @stiga-huang @ffacs @coderex2522

ffacs commented 9 months ago

+1, LGTM

XinyuZeng commented 9 months ago

@XinyuZeng , could you revise the PR title and description? For example, PR title seems to become inconsistent from the AS-IS code.

Done

dongjoon-hyun commented 9 months ago

Merged to main/2.0 for Apache ORC 2.0.0.