apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
689 stars 483 forks source link

ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918 #1734

Closed dongjoon-hyun closed 10 months ago

dongjoon-hyun commented 10 months ago

What changes were proposed in this pull request?

This PR aims to fix SparkBenchmark according to the requirement of SPARK-40918.

Note that this fixes the synthetic benchmark on Sales data. For the other real-life dataset (github and taxi), we will revisit.

Why are the changes needed?

  1. Generate Sales data

    $ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
  2. Run Spark Benchmark

    
    $ java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -d sales -f orc
    # Run complete. Total time: 00:10:45

Benchmark (compression) (dataset) (format) Mode Cnt Score Error Units SparkBenchmark.fullRead gz sales orc avgt 5 686792.235 ± 4398.971 us/op SparkBenchmark.fullRead:bytesPerRecord gz sales orc avgt 5 0.192 # SparkBenchmark.fullRead:ops gz sales orc avgt 5 40.000 # SparkBenchmark.fullRead:perRecord gz sales orc avgt 5 0.687 ± 0.004 us/op SparkBenchmark.fullRead:records gz sales orc avgt 5 5000000.000 # SparkBenchmark.fullRead snappy sales orc avgt 5 286166.380 ± 19864.429 us/op SparkBenchmark.fullRead:bytesPerRecord snappy sales orc avgt 5 0.201 # SparkBenchmark.fullRead:ops snappy sales orc avgt 5 40.000 # SparkBenchmark.fullRead:perRecord snappy sales orc avgt 5 0.286 ± 0.020 us/op SparkBenchmark.fullRead:records snappy sales orc avgt 5 5000000.000 # SparkBenchmark.fullRead zstd sales orc avgt 5 384394.233 ± 10057.315 us/op SparkBenchmark.fullRead:bytesPerRecord zstd sales orc avgt 5 0.192 # SparkBenchmark.fullRead:ops zstd sales orc avgt 5 40.000 # SparkBenchmark.fullRead:perRecord zstd sales orc avgt 5 0.384 ± 0.010 us/op SparkBenchmark.fullRead:records zstd sales orc avgt 5 5000000.000 # SparkBenchmark.partialRead gz sales orc avgt 5 41683.914 ± 4046.077 us/op SparkBenchmark.partialRead:bytesPerRecord gz sales orc avgt 5 0.192 # SparkBenchmark.partialRead:ops gz sales orc avgt 5 40.000 # SparkBenchmark.partialRead:perRecord gz sales orc avgt 5 0.042 ± 0.004 us/op SparkBenchmark.partialRead:records gz sales orc avgt 5 5000000.000 # SparkBenchmark.partialRead snappy sales orc avgt 5 23981.054 ± 17874.229 us/op SparkBenchmark.partialRead:bytesPerRecord snappy sales orc avgt 5 0.201 # SparkBenchmark.partialRead:ops snappy sales orc avgt 5 40.000 # SparkBenchmark.partialRead:perRecord snappy sales orc avgt 5 0.024 ± 0.018 us/op SparkBenchmark.partialRead:records snappy sales orc avgt 5 5000000.000 # SparkBenchmark.partialRead zstd sales orc avgt 5 41433.277 ± 25110.021 us/op SparkBenchmark.partialRead:bytesPerRecord zstd sales orc avgt 5 0.192 # SparkBenchmark.partialRead:ops zstd sales orc avgt 5 40.000 # SparkBenchmark.partialRead:perRecord zstd sales orc avgt 5 0.041 ± 0.025 us/op SparkBenchmark.partialRead:records zstd sales orc avgt 5 5000000.000 # SparkBenchmark.pushDown gz sales orc avgt 5 23760.997 ± 833.034 us/op SparkBenchmark.pushDown:bytesPerRecord gz sales orc avgt 5 19.153 # SparkBenchmark.pushDown:ops gz sales orc avgt 5 40.000 # SparkBenchmark.pushDown:perRecord gz sales orc avgt 5 2.376 ± 0.083 us/op SparkBenchmark.pushDown:records gz sales orc avgt 5 50000.000 # SparkBenchmark.pushDown snappy sales orc avgt 5 14062.508 ± 1793.691 us/op SparkBenchmark.pushDown:bytesPerRecord snappy sales orc avgt 5 20.105 # SparkBenchmark.pushDown:ops snappy sales orc avgt 5 40.000 # SparkBenchmark.pushDown:perRecord snappy sales orc avgt 5 1.406 ± 0.179 us/op SparkBenchmark.pushDown:records snappy sales orc avgt 5 50000.000 # SparkBenchmark.pushDown zstd sales orc avgt 5 15597.651 ± 1307.246 us/op SparkBenchmark.pushDown:bytesPerRecord zstd sales orc avgt 5 19.213 # SparkBenchmark.pushDown:ops zstd sales orc avgt 5 40.000 # SparkBenchmark.pushDown:perRecord zstd sales orc avgt 5 1.560 ± 0.131 us/op SparkBenchmark.pushDown:records zstd sales orc avgt 5 50000.000 #



### How was this patch tested?

Pass the CIs.