apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
675 stars 480 forks source link

ORC-1631: Support `summary` output in `sizes` command #1816

Closed cxzl25 closed 6 months ago

cxzl25 commented 6 months ago

What changes were proposed in this pull request?

Add support for summarizing the number of files, file sizes and file lines in the sizes command.

Why are the changes needed?

When we count the size of each field, we only know the percentage and the average size of each row, but we do not know the overall value.

How was this patch tested?

local test

java -jar orc-tools-2.1.0-SNAPSHOT-uber.jar sizes -h
usage: sizes
 -h,--help              Print help message
 -i,--ignoreExtension   Ignore ORC file extension
 -s,--summary           Summarize the number of files, file sizes, and
                        file rows
java -jar orc-tools-2.1.0-SNAPSHOT-uber.jar sizes -s
Total Files: 5
Total Sizes: 4803687270
Total Rows: 39820045
Percent  Bytes/Row  Name
  26.41  31.86

Was this patch authored or co-authored using generative AI tooling?

No