Open Yohahaha opened 8 months ago
@zhztheplayer Looks it's caused by offheap memory competition of external sort and native code
Would you like to share the explain result?
Also should do some debugging. There are several conditions that causes UnsafeExternalSorter not spill https://github.com/apache/spark/blob/3793c2f44099ccd10ed23e6a5d4c63734c788e6f/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L203
Would you like to share the explain result?
Also should do some debugging. There are several conditions that causes UnsafeExternalSorter not spill https://github.com/apache/spark/blob/3793c2f44099ccd10ed23e6a5d4c63734c788e6f/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L203
it's our customer's case, dynamic partition write will insert a sort if needed. I check the UI of disable Gluten, sort will spill.
I see.
TreeMemoryConsumer
we are using should be able to spill vanilla Spark's operator by current design. See code https://github.com/oap-project/gluten/blob/6d7d01fe0a433b82d3f216c992d244eb4686949a/gluten-core/src/main/java/io/glutenproject/memory/memtarget/spark/TreeMemoryConsumer.java#L65-L73. There might be something wrong during the procedure.
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 10598.0 failed 1 times, most recent failure: Lost task 9.0 in stage 10598.0 (TID 770839) (node2 executor 7): java.lang.RuntimeException: Error during calling Java code from native code: io.glutenproject.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8388608, granted: 0. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application.
> Current config settings:
> spark.gluten.memory.offHeap.size.in.bytes=10.0 GiB
> spark.gluten.memory.task.offHeap.size.in.bytes=1462.9 MiB
> spark.gluten.memory.conservative.task.offHeap.size.in.bytes=731.4 MiB
> Memory consumer stats:
> Task.770839: Current used bytes: 2.2 GiB, peak bytes: N/A
> \- Gluten.Tree.28589: Current used bytes: 2.2 GiB, peak bytes: 2.9 GiB
> \- root.28589: Current used bytes: 2.2 GiB, peak bytes: 2.9 GiB
> +- WholeStageIterator.31330: Current used bytes: 2.2 GiB, peak bytes: 2.2 GiB
> | \- single: Current used bytes: 2.2 GiB, peak bytes: 2.2 GiB
> | +- task.Gluten_Stage_10598_TID_770839: Current used bytes: 2.2 GiB, peak bytes: 2.2 GiB
> | | +- node.1: Current used bytes: 2.2 GiB, peak bytes: 2.2 GiB
> | | | \- op.1.0.0.OrderBy: Current used bytes: 5.1 MiB, peak bytes: 2.0 GiB
> | | \- node.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
> | | \- op.0.0.0.ValueStream: Current used bytes: 0.0 B, peak bytes: 0.0 B
> | \- WholeStageIterator_default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
> +- ArrowContextInstance.20857: Current used bytes: 8.0 MiB, peak bytes: 8.0 MiB
> +- OverAcquire.DummyTarget.81792: Current used bytes: 0.0 B, peak bytes: 0.0 B
> +- ShuffleReader.7022: Current used bytes: 0.0 B, peak bytes: 8.0 MiB
> | \- single: Current used bytes: 0.0 B, peak bytes: 2.0 MiB
> | \- ShuffleReader_default_leaf: Current used bytes: 0.0 B, peak bytes: 1920.0 KiB
> +- ColumnarToRow.142: Current used bytes: 0.0 B, peak bytes: 0.0 B
> | \- single: Current used bytes: 0.0 B, peak bytes: 0.0 B
> | \- ColumnarToRow_default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
> +- OverAcquire.DummyTarget.78951: Current used bytes: 0.0 B, peak bytes: 2.4 MiB
> \- OverAcquire.DummyTarget.78950: Current used bytes: 0.0 B, peak bytes: 676.8 MiB
> at io.glutenproject.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:90)
> at io.glutenproject.memory.nmm.ManagedReservationListener.reserve(ManagedReservationListener.java:43)
> at io.glutenproject.vectorized.NativeColumnarToRowJniWrapper.nativeColumnarToRowConvert(Native Method)
> at io.glutenproject.execution.VeloxColumnarToRowExec$$anon$1.next(VeloxColumnarToRowExec.scala:138)
> at io.glutenproject.execution.VeloxColumnarToRowExec$$anon$1.next(VeloxColumnarToRowExec.scala:104)
@yixi-gu your case is native order by doesn't spill, not my case.
Backend
VL (Velox)
Bug description
I suppose in below case, UnsafeExternalSorter should spill.
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response