Closed XinyuZeng closed 9 months ago
cc @stiga-huang @ffacs @coderex2522
+1, LGTM
@XinyuZeng , could you revise the PR title and description? For example, PR title seems to become inconsistent from the AS-IS code.
Done
Merged to main/2.0 for Apache ORC 2.0.0.
What changes were proposed in this pull request?
See https://github.com/apache/orc/issues/1430 - default DataBuffer constructor now uses
reserve
.I also add a
zeroOut
function to manually memset the buffer. For example,MapColumnWrite::add
may use uninitializedoffsets
value (atuint64_t elemOffset = static_cast<uint64_t>(offsets[0]);
). i.e., not memset theoffsets
buffer will result in segfault because the code implicitly assume the value should be 0. Therefore I stillzeroOut
the buffer for Map, List and Union, even though List and Union pass the unit tests without memset.Why are the changes needed?
Reduce overhead in
ColumnVectorBatch
construction and compression/decompression buffer constructionHow was this patch tested?
Correctness via
make test-out
. I further useorc-scan
utility, settingbatchSize
to 1 million and then scan a file with 1 million rows and 20 columns. After this PR, the time reduces from 2.9s to 2.2s. The compression/decompression buffer overhead reported in #1430 can be verified by the reporter if necessary.Was this patch authored or co-authored using generative AI tooling?
No