spark write orc error: Java heap space

chenwyi2 commented 1 year ago

Apache Iceberg version

1.2.1

Query engine

Spark

Please describe the bug 🐞

spark use 3.1 version when i used spark to write iceberg with orc format, it error with Java heap space, detailed information is below java.lang.OutOfMemoryError: Java heap space at org.apache.iceberg.shaded.org.apache.orc.storage.ql.exec.vector.LongColumnVector.ensureSize(LongColumnVector.java:314) at org.apache.iceberg.shaded.org.apache.orc.storage.ql.exec.vector.StructColumnVector.ensureSize(StructColumnVector.java:136) at org.apache.iceberg.spark.data.SparkOrcValueWriters.growColumnVector(SparkOrcValueWriters.java:198) at org.apache.iceberg.spark.data.SparkOrcValueWriters.access$300(SparkOrcValueWriters.java:39) at org.apache.iceberg.spark.data.SparkOrcValueWriters$ListWriter.nonNullWrite(SparkOrcValueWriters.java:137) at org.apache.iceberg.spark.data.SparkOrcValueWriters$ListWriter.nonNullWrite(SparkOrcValueWriters.java:116) at org.apache.iceberg.orc.OrcValueWriter.write(OrcValueWriter.java:42) at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.write(GenericOrcWriters.java:483) at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.nonNullWrite(GenericOrcWriters.java:469) at org.apache.iceberg.orc.OrcValueWriter.write(OrcValueWriter.java:42) at org.apache.iceberg.spark.data.SparkOrcValueWriters$ListWriter.nonNullWrite(SparkOrcValueWriters.java:140) at org.apache.iceberg.spark.data.SparkOrcValueWriters$ListWriter.nonNullWrite(SparkOrcValueWriters.java:116) at org.apache.iceberg.orc.OrcValueWriter.write(OrcValueWriter.java:42) at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.write(GenericOrcWriters.java:483) at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.writeRow(GenericOrcWriters.java:476) at org.apache.iceberg.spark.data.SparkOrcWriter.write(SparkOrcWriter.java:60) at org.apache.iceberg.spark.data.SparkOrcWriter.write(SparkOrcWriter.java:46) at org.apache.iceberg.orc.OrcFileAppender.add(OrcFileAppender.java:83) at org.apache.iceberg.io.DataWriter.write(DataWriter.java:61) at org.apache.iceberg.io.ClusteredWriter.write(ClusteredWriter.java:103) at org.apache.iceberg.io.ClusteredDataWriter.write(ClusteredDataWriter.java:34) at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.write(SparkWrite.java:629) at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.write(SparkWrite.java:604) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:416) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$$Lambda$1166/1819967781.apply(Unknown Source) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1504) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec$$Lambda$716/86102097.apply(Unknown Source) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)

RussellSpitzer commented 1 year ago

Have you tried with more executor memory? I know the ORC Writer currently doesn't roll over so you could just be buffering a very large spark task before writing.

If the issue is solved with more executor memory, you could either keep that setting or attempt to break the write tasks into smaller chunks.

chenwyi2 commented 1 year ago

when i gave 10G to executor memory, it will be ok. Comapred with parquet, orc will use more executor memory, Will we support ORC Writer roll over in order to decrease memory consumption?

RussellSpitzer commented 1 year ago

Pull requests are welcome :) If you would like to add in an ORC Rolling File Writer I think that would be appreciated. I don't know of many ORC users so I think that's why we haven't seen it added yet.

chenwyi2 commented 1 year ago

it seems like https://github.com/apache/iceberg/pull/3784/ has already add rolling writer in ORC, maybe i should upgrage iceberg version, thanks

chenwyi2 commented 1 year ago

but when i use flink to write with orc, it still error with Java heap space: cause: java.lang.OutOfMemoryError: Java heap space at org.apache.iceberg.shaded.org.apache.orc.storage.ql.exec.vector.LongColumnVector.ensureSize(LongColumnVector.java:314) at org.apache.iceberg.shaded.org.apache.orc.storage.ql.exec.vector.StructColumnVector.ensureSize(StructColumnVector.java:136) at org.apache.iceberg.flink.data.FlinkOrcWriters.growColumnVector(FlinkOrcWriters.java:314) at org.apache.iceberg.flink.data.FlinkOrcWriters.access$500(FlinkOrcWriters.java:47) at org.apache.iceberg.flink.data.FlinkOrcWriters$ListWriter.nonNullWrite(FlinkOrcWriters.java:234) at org.apache.iceberg.flink.data.FlinkOrcWriters$ListWriter.nonNullWrite(FlinkOrcWriters.java:217) at org.apache.iceberg.orc.OrcValueWriter.write(OrcValueWriter.java:41) at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.write(GenericOrcWriters.java:509) at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.nonNullWrite(GenericOrcWriters.java:495) at org.apache.iceberg.orc.OrcValueWriter.write(OrcValueWriter.java:41) at org.apache.iceberg.flink.data.FlinkOrcWriters$ListWriter.nonNullWrite(FlinkOrcWriters.java:238) at org.apache.iceberg.flink.data.FlinkOrcWriters$ListWriter.nonNullWrite(FlinkOrcWriters.java:217) at org.apache.iceberg.orc.OrcValueWriter.write(OrcValueWriter.java:41) at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.write(GenericOrcWriters.java:509) at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.writeRow(GenericOrcWriters.java:502) at org.apache.iceberg.flink.data.FlinkOrcWriter.write(FlinkOrcWriter.java:54) at org.apache.iceberg.flink.data.FlinkOrcWriter.write(FlinkOrcWriter.java:38) at org.apache.iceberg.orc.OrcFileAppender.add(OrcFileAppender.java:96) at org.apache.iceberg.io.DataWriter.write(DataWriter.java:71) at org.apache.iceberg.io.BaseTaskWriter$RollingFileWriter.write(BaseTaskWriter.java:362) at org.apache.iceberg.io.BaseTaskWriter$RollingFileWriter.write(BaseTaskWriter.java:345) at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.write(BaseTaskWriter.java:277) at org.apache.iceberg.io.PartitionedFanoutWriter.write(PartitionedFanoutWriter.java:68) at org.apache.iceberg.flink.sink.IcebergStreamWriter.processElement(IcebergStreamWriter.java:97) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:82) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:57) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:56) at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:29) at StreamExecCalc$208.processElement_split12(Unknown Source) at StreamExecCalc$208.processElement(Unknown Source) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:82)

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 1 month ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

apache / iceberg