[VL] One case of should spill but OOM

Yohahaha commented 8 months ago

Backend

VL (Velox)

Bug description

I suppose in below case, UnsafeExternalSorter should spill.

Caused by: java.lang.RuntimeException: Error during calling Java code from native code: io.glutenproject.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8388608, granted: 8126464. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application. 
Current config settings: 
    spark.gluten.memory.offHeap.size.in.bytes=12.0 GiB
    spark.gluten.memory.task.offHeap.size.in.bytes=3.0 GiB
    spark.gluten.memory.conservative.task.offHeap.size.in.bytes=1536.0 MiB
Memory consumer stats: 
    Task.241798:                                                                   Current used bytes:   3.0 GiB, peak bytes:        N/A
    +- org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@4479ef33: Current used bytes:   2.1 GiB, peak bytes:        N/A
    \- Gluten.Tree.18:                                                             Current used bytes: 952.0 MiB, peak bytes: 1048.0 MiB
       \- root.18:                                                                 Current used bytes: 952.0 MiB, peak bytes: 1048.0 MiB
          +- ShuffleReader.18:                                                     Current used bytes: 560.0 MiB, peak bytes:  600.0 MiB
          |  \- single:                                                            Current used bytes: 256.0 MiB, peak bytes:  272.0 MiB
          |     \- ShuffleReader_default_leaf:                                     Current used bytes: 248.1 MiB, peak bytes:  270.3 MiB
          +- ColumnarToRow.18:                                                     Current used bytes: 384.0 MiB, peak bytes:  656.0 MiB
          |  \- single:                                                            Current used bytes: 384.0 MiB, peak bytes:  640.0 MiB
          |     \- ColumnarToRow_default_leaf:                                     Current used bytes: 384.0 MiB, peak bytes:  640.0 MiB
          +- ArrowContextInstance.18:                                              Current used bytes:   8.0 MiB, peak bytes:    8.0 MiB
          +- OverAcquire.DummyTarget.52:                                           Current used bytes:     0.0 B, peak bytes:      0.0 B
          +- WholeStageIterator.18:                                                Current used bytes:     0.0 B, peak bytes:      0.0 B
          |  \- single:                                                            Current used bytes:     0.0 B, peak bytes:      0.0 B
          |     +- WholeStageIterator_default_leaf:                                Current used bytes:     0.0 B, peak bytes:      0.0 B
          |     \- task.Gluten_Stage_13_TID_241798:                                Current used bytes:     0.0 B, peak bytes:      0.0 B
          |        +- node.1:                                                      Current used bytes:     0.0 B, peak bytes:      0.0 B
          |        |  \- op.1.0.0.FilterProject:                                   Current used bytes:     0.0 B, peak bytes:      0.0 B
          |        \- node.0:                                                      Current used bytes:     0.0 B, peak bytes:      0.0 B
          |           \- op.0.0.0.ValueStream:                                     Current used bytes:     0.0 B, peak bytes:      0.0 B
          +- OverAcquire.DummyTarget.56:                                           Current used bytes:     0.0 B, peak bytes:      0.0 B
          \- OverAcquire.DummyTarget.55:                                           Current used bytes:     0.0 B, peak bytes:      0.0 B

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

FelixYBW commented 8 months ago

@zhztheplayer Looks it's caused by offheap memory competition of external sort and native code

zhztheplayer commented 8 months ago

Would you like to share the explain result?

Also should do some debugging. There are several conditions that causes UnsafeExternalSorter not spill https://github.com/apache/spark/blob/3793c2f44099ccd10ed23e6a5d4c63734c788e6f/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L203

Yohahaha commented 8 months ago

Would you like to share the explain result?

Also should do some debugging. There are several conditions that causes UnsafeExternalSorter not spill https://github.com/apache/spark/blob/3793c2f44099ccd10ed23e6a5d4c63734c788e6f/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L203

it's our customer's case, dynamic partition write will insert a sort if needed. I check the UI of disable Gluten, sort will spill.

zhztheplayer commented 8 months ago

I see.

TreeMemoryConsumer we are using should be able to spill vanilla Spark's operator by current design. See code https://github.com/oap-project/gluten/blob/6d7d01fe0a433b82d3f216c992d244eb4686949a/gluten-core/src/main/java/io/glutenproject/memory/memtarget/spark/TreeMemoryConsumer.java#L65-L73. There might be something wrong during the procedure.

yixi-gu commented 8 months ago

> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 10598.0 failed 1 times, most recent failure: Lost task 9.0 in stage 10598.0 (TID 770839) (node2 executor 7): java.lang.RuntimeException: Error during calling Java code from native code: io.glutenproject.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8388608, granted: 0. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application.
> Current config settings:
>         spark.gluten.memory.offHeap.size.in.bytes=10.0 GiB
>         spark.gluten.memory.task.offHeap.size.in.bytes=1462.9 MiB
>         spark.gluten.memory.conservative.task.offHeap.size.in.bytes=731.4 MiB
> Memory consumer stats:
>         Task.770839:                                       Current used bytes: 2.2 GiB, peak bytes:        N/A
>         \- Gluten.Tree.28589:                              Current used bytes: 2.2 GiB, peak bytes:    2.9 GiB
>            \- root.28589:                                  Current used bytes: 2.2 GiB, peak bytes:    2.9 GiB
>               +- WholeStageIterator.31330:                 Current used bytes: 2.2 GiB, peak bytes:    2.2 GiB
>               |  \- single:                                Current used bytes: 2.2 GiB, peak bytes:    2.2 GiB
>               |     +- task.Gluten_Stage_10598_TID_770839: Current used bytes: 2.2 GiB, peak bytes:    2.2 GiB
>               |     |  +- node.1:                          Current used bytes: 2.2 GiB, peak bytes:    2.2 GiB
>               |     |  |  \- op.1.0.0.OrderBy:             Current used bytes: 5.1 MiB, peak bytes:    2.0 GiB
>               |     |  \- node.0:                          Current used bytes:   0.0 B, peak bytes:      0.0 B
>               |     |     \- op.0.0.0.ValueStream:         Current used bytes:   0.0 B, peak bytes:      0.0 B
>               |     \- WholeStageIterator_default_leaf:    Current used bytes:   0.0 B, peak bytes:      0.0 B
>               +- ArrowContextInstance.20857:               Current used bytes: 8.0 MiB, peak bytes:    8.0 MiB
>               +- OverAcquire.DummyTarget.81792:            Current used bytes:   0.0 B, peak bytes:      0.0 B
>               +- ShuffleReader.7022:                       Current used bytes:   0.0 B, peak bytes:    8.0 MiB
>               |  \- single:                                Current used bytes:   0.0 B, peak bytes:    2.0 MiB
>               |     \- ShuffleReader_default_leaf:         Current used bytes:   0.0 B, peak bytes: 1920.0 KiB
>               +- ColumnarToRow.142:                        Current used bytes:   0.0 B, peak bytes:      0.0 B
>               |  \- single:                                Current used bytes:   0.0 B, peak bytes:      0.0 B
>               |     \- ColumnarToRow_default_leaf:         Current used bytes:   0.0 B, peak bytes:      0.0 B
>               +- OverAcquire.DummyTarget.78951:            Current used bytes:   0.0 B, peak bytes:    2.4 MiB
>               \- OverAcquire.DummyTarget.78950:            Current used bytes:   0.0 B, peak bytes:  676.8 MiB
>         at io.glutenproject.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:90)
>         at io.glutenproject.memory.nmm.ManagedReservationListener.reserve(ManagedReservationListener.java:43)
>         at io.glutenproject.vectorized.NativeColumnarToRowJniWrapper.nativeColumnarToRowConvert(Native Method)
>         at io.glutenproject.execution.VeloxColumnarToRowExec$$anon$1.next(VeloxColumnarToRowExec.scala:138)
>         at io.glutenproject.execution.VeloxColumnarToRowExec$$anon$1.next(VeloxColumnarToRowExec.scala:104)

Yohahaha commented 8 months ago

@yixi-gu your case is native order by doesn't spill, not my case.

apache / incubator-gluten