facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.42k stars 1.12k forks source link

OrderBy OOM in getOutput stage. #10940

Closed JkSelf closed 2 days ago

JkSelf commented 1 week ago

Bug description

When we run 2 TB TPC-DS, we found Q72 failed with the following OOM exceptions in getOutput stage.


W0906 20:59:03.017148 3105316 Operator.cpp:644] Can't reclaim from memory pool op.1.0.0.OrderBy which is under non-reclaimable section, memory usage: 5.61GB, reservation: 5.61GB

Reason: Operator::getOutput failed for [operator: OrderBy, plan node ID: 1]: Error during calling Java code from native code: org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 0.0 B. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled). 
Current config settings: 
    spark.gluten.memory.offHeap.size.in.bytes=45.0 GiB
    spark.gluten.memory.task.offHeap.size.in.bytes=5.6 GiB
    spark.gluten.memory.conservative.task.offHeap.size.in.bytes=2.8 GiB
    spark.memory.offHeap.enabled=true
    spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
Memory consumer stats: 
    Task.13546:                                                Current used bytes: 5.6 GiB, peak bytes:        N/A
    \- Gluten.Tree.129:                                        Current used bytes: 5.6 GiB, peak bytes:    5.6 GiB
       \- root.129:                                            Current used bytes: 5.6 GiB, peak bytes:    5.6 GiB
          +- WholeStageIterator.129:                           Current used bytes: 5.6 GiB, peak bytes:    5.6 GiB
          |  \- single:                                        Current used bytes: 5.6 GiB, peak bytes:    5.6 GiB
          |     +- root:                                       Current used bytes: 5.6 GiB, peak bytes:    5.6 GiB
          |     |  +- task.Gluten_Stage_98_TID_13546_VTID_129: Current used bytes: 5.6 GiB, peak bytes:    5.6 GiB
          |     |  |  +- node.1:                               Current used bytes: 5.6 GiB, peak bytes:    5.6 GiB
          |     |  |  |  \- op.1.0.0.OrderBy:                  Current used bytes: 5.6 GiB, peak bytes:    5.6 GiB
          |     |  |  +- node.3:                               Current used bytes: 768.0 B, peak bytes: 1024.0 KiB
          |     |  |  |  \- op.3.1.0.OrderBy:                  Current used bytes: 768.0 B, peak bytes:   64.8 KiB
          |     |  |  +- node.5:                               Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  |  \- op.5.0.0.FilterProject:            Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  +- node.6:                               Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  |  \- op.6.0.0.FilterProject:            Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  +- node.4:                               Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  |  \- op.4.0.0.MergeJoin:                Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  +- node.2:                               Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  |  \- op.2.1.0.ValueStream:              Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  +- node.0:                               Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  |  \- op.0.0.0.ValueStream:              Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |  \- node.N/A:                             Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  |     \- op.N/A.1.0.CallbackSink:           Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  \- default_leaf:                            Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     \- gluten::MemoryAllocator:                    Current used bytes:   0.0 B, peak bytes:      0.0 B
          +- ArrowContextInstance.224:                         Current used bytes: 8.0 MiB, peak bytes:    8.0 MiB
          +- VeloxBatchResizer.129:                            Current used bytes:   0.0 B, peak bytes:      0.0 B
          |  \- single:                                        Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     +- root:                                       Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  \- default_leaf:                            Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     \- gluten::MemoryAllocator:                    Current used bytes:   0.0 B, peak bytes:      0.0 B
          +- ArrowContextInstance.225:                         Current used bytes:   0.0 B, peak bytes:      0.0 B
          +- ShuffleReader.96:                                 Current used bytes:   0.0 B, peak bytes:   16.0 MiB
          |  \- single:                                        Current used bytes:   0.0 B, peak bytes:   16.0 MiB
          |     +- gluten::MemoryAllocator:                    Current used bytes:   0.0 B, peak bytes: 1694.5 KiB
          |     \- root:                                       Current used bytes:   0.0 B, peak bytes:    2.0 MiB
          |        \- default_leaf:                            Current used bytes:   0.0 B, peak bytes: 1328.0 KiB
          +- IndicatorVectorBase#init.128:                     Current used bytes:   0.0 B, peak bytes:      0.0 B
          |  \- single:                                        Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     +- gluten::MemoryAllocator:                    Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     \- root:                                       Current used bytes:   0.0 B, peak bytes:      0.0 B
          |        \- default_leaf:                            Current used bytes:   0.0 B, peak bytes:      0.0 B
          +- VeloxBatchResizer.129.OverAcquire.0:              Current used bytes:   0.0 B, peak bytes:      0.0 B
          +- IndicatorVectorBase#init.128.OverAcquire.0:       Current used bytes:   0.0 B, peak bytes:      0.0 B
          +- ShuffleReader.96.OverAcquire.0:                   Current used bytes:   0.0 B, peak bytes:    4.8 MiB
          +- ShuffleWriter.129.OverAcquire.0:                  Current used bytes:   0.0 B, peak bytes:      0.0 B
          +- ShuffleWriter.129:                                Current used bytes:   0.0 B, peak bytes:      0.0 B
          |  \- single:                                        Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     +- root:                                       Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     |  \- default_leaf:                            Current used bytes:   0.0 B, peak bytes:      0.0 B
          |     \- gluten::MemoryAllocator:                    Current used bytes:   0.0 B, peak bytes:      0.0 B
          \- WholeStageIterator.129.OverAcquire.0:             Current used bytes:   0.0 B, peak bytes: 1327.2 MiB

    at org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:105)
    at org.apache.gluten.memory.listener.ManagedReservationListener.reserve(ManagedReservationListener.java:49)
    at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
    at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:61)
    at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
    at org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
    at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
    at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
    at org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
    at org.apache.gluten.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
    at org.apache.gluten.vectorized.GeneralInIterator.hasNext(GeneralInIterator.java:31)
    at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
    at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:61)
    at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
    at org.apache.gluten.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127)
    at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
    at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:122)
    at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:239)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

System information

Velox System Info v0.0.2 Commit: 24e13a142d18e0c78c827052f56a84f09c2ecaa8 CMake Version: 3.28.3 System: Linux-5.4.0-189-generic Arch: x86_64 C++ Compiler: /usr/bin/c++ C++ Compiler Version: 9.4.0 C Compiler: /usr/bin/cc C Compiler Version: 9.4.0 CMake Prefix Path: /usr/local;/usr;/;/usr/local/lib/python3.8/dist-packages/cmake/data;/usr/local;/usr/X11R6;/usr/pkg;/opt

\nThe results will be copied to your clipboard if xclip is installed.

Relevant logs

No response

pedroerp commented 1 week ago

@xiaoxmeng @tanjialiang @aditi-pandit is this the expected behavior and just means the query ran out of memory? Alternatively, don't we already support spilling for OrderBys?

aditi-pandit commented 1 week ago

@pedroerp : OrderyBy does have spilling support. My understanding is that Gluten doesn't use Velox memory arbitration or spilling and has its own custom implementation.

We have been able to run this query on Prestissimo clusters successfully. Though we didn't do 2TB specifically, we did 1, 10 and 100TB.

xiaoxmeng commented 1 week ago

@pedroerp @aditi-pandit we have supported order by spill and Gluten also does support memory arbitration which connects Velox memory management with Spark memory manager (Gluten has its own arbitrator implement which follows the Velox memory arbitrator interface from @zhztheplayer ). Have offline discussed this issue with @JkSelf and @zhztheplayer, this is because we don't reserve memory when get output from sort buffer. And the memory arbitration is triggered during get output from sort buffer. We need to reserve memory before get output processing as we do for hash aggregation and hash probe. @JkSelf will help on this. @zhztheplayer will help to add native call stack on failure to ease debugging similar issues.

pedroerp commented 6 days ago

Great, thanks @xiaoxmeng for diagnosing the issue.

@JkSelf I assigned this Issue to you based on the comment above.

JkSelf commented 2 days ago

Fixed in https://github.com/facebookincubator/velox/pull/10958

pedroerp commented 10 hours ago

Thank you @JkSelf