apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
665 stars 477 forks source link

ORC-1699: Fix SparkBenchmark in Parquet format according to SPARK-40918 #1908

Closed cxzl25 closed 2 months ago

cxzl25 commented 2 months ago

What changes were proposed in this pull request?

This PR aims to fix SparkBenchmark in Parquet format according to SPARK-40918.

Why are the changes needed?

Similar to ORC-1578, there are similar problems when reading parquet format files in SparkBenchmark.

java.lang.IllegalArgumentException: OPTION_RETURNING_BATCH should always be set for ParquetFileFormat. To workaround this issue, set spark.sql.parquet.enableVectorizedReader=false.
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:192)
    at scala.collection.immutable.Map$EmptyMap$.getOrElse(Map.scala:110)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:191)
    at org.apache.orc.bench.spark.SparkBenchmark.pushDown(SparkBenchmark.java:314)
    at org.apache.orc.bench.spark.jmh_generated.SparkBenchmark_pushDown_jmhTest.pushDown_avgt_jmhStub(SparkBenchmark_pushDown_jmhTest.java:219)

How was this patch tested?

local test

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun commented 2 months ago

Thank you! Merged to main/2.0/1.9.