[VL] Memory pool issue when enable multiple threads I/O pool

zhli1142015 commented 1 year ago

Backend

VL (Velox)

Bug description

Crash happens in the second run. Two errors observed from logs:

23/10/10 07:48:50 WARN [Gluten] NativeMemoryManager: WholeStageIterator Reservation listener still reserved non-zero bytes, which may cause memory leak, size: 10.4 MiB
E20231010 07:48:50.792629 23453 Exceptions.h:69] Line: /var/git/Velox/velox/common/memory/Memory.cpp:181, Function:dropPool, Expression:  The dropped memory pool WholeStageIterator_root not found, Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: The dropped memory pool WholeStageIterator_root not found
Retriable: False
Function: dropPool
File: /var/git/Velox/velox/common/memory/Memory.cpp
Line: 181
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox6memory13MemoryManager8dropPoolEPNS1_10MemoryPoolE
# 4  _ZN8facebook5velox6memory14MemoryPoolImplD2Ev
# 5  _ZNSt23_Sp_counted_ptr_inplaceIN8facebook5velox4core8QueryCtxESaIS3_ELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv
# 6  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv
# 7  _ZN8facebook5velox4exec4TaskD2Ev
# 8  _ZNSt15_Sp_counted_ptrIPN8facebook5velox4exec4TaskELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv
# 9  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv
# 10 _ZN5folly6detail8function13DispatchSmall4execIZZN8facebook5velox4exec9TableScan12checkPreloadEvENKUlSt10shared_ptrINS5_9connector14ConnectorSplitEEE_clESB_EUlvE_EEmNS1_2OpEPNS1_4DataESG_
# 11 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE
# 12 _ZN5folly6detail8function14FunctionTraitsIFvvEE7callBigIZNS_20IOThreadPoolExecutor3addENS_8FunctionIS3_EENSt6chrono8durationIlSt5ratioILl1ELl1000EEEES8_EUlvE_EEvRNS1_4DataE
# 13 _ZN5folly23AtomicNotificationQueueINS_8FunctionIFvvEEEE5driveIRNS_9EventBase10FuncRunnerEEEbOT_
# 14 _ZThn40_N5folly32EventBaseAtomicNotificationQueueINS_8FunctionIFvvEEENS_9EventBase10FuncRunnerEE12handlerReadyEt
# 15 event_process_active_single_queue.isra.0
# 16 event_base_loop
# 17 _ZN5folly9EventBase8loopMainEib
# 18 _ZN5folly9EventBase8loopBodyEib
# 19 _ZN5folly9EventBase4loopEv
# 20 _ZN5folly9EventBase11loopForeverEv
# 21 _ZN5folly20IOThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE
# 22 _ZN5folly6detail8function14FunctionTraitsIFvvEE9callSmallISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS7_6ThreadEEEPS7_SA_EEEEvRNS1_4DataE
# 23 execute_native_thread_routine
# 24 start_thread
# 25 clone


java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: ListenableArbitrator should only be used within a single root pool
Retriable: False
Function: runInternal
File: /var/git/Velox/velox/exec/Driver.cpp
Line: 421
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE.cold
# 4  _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 5  _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 6  _ZN6gluten24WholeStageResultIterator4nextEv
# 7  Java_io_glutenproject_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 8  0x00007fc949d46ca8

  at io.glutenproject.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
  at io.glutenproject.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:61)
  at io.glutenproject.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:32)
  at io.glutenproject.backendsapi.velox.IteratorHandler$$anon$1.hasNext(IteratorHandler.scala:174)
  at io.glutenproject.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:40)
  at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:102)
  at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:217)
  at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
  at org.apache.spark.scheduler.Task.run(Task.scala:136)
  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:750)

Spark version

Spark-3.3.x

Spark configurations

"spark.sql.broadcastTimeout": "-1", "spark.executor.instances": "8", "spark.driver.memory": "56g", "spark.executor.memory": "10g", "spark.executor.cores": "8", "spark.driver.cores": "8", "spark.plugins": "io.glutenproject.GlutenPlugin", "spark.gluten.sql.columnar.backend.lib": "velox", "spark.memory.offHeap.enabled": "true", "spark.memory.offHeap.size": "53g", "spark.executor.memoryOverhead": "1024", "spark.shuffle.manager": "org.apache.spark.shuffle.sort.ColumnarShuffleManager", "spark.gluten.sql.columnar.backend.velox.SplitPreloadPerDriver": "4"

System information

No response

Relevant logs

No response

Yohahaha commented 1 year ago

We found same issue, MemoryArbitrator used in Gluten has conflict with split preload, IO executor hold Task ref and may released after ExecutionCtx dtor, that Task ref contains shared ref of MemoryPool from QueryCtx.

Actually, it not related with q49, every query has possibility to reproduce.

zhli1142015 commented 1 year ago

We found same issue, MemoryArbitrator used in Gluten has conflict with split preload, IO executor hold Task ref and may released after ExecutionCtx dtor, that Task ref contains shared ref of MemoryPool from QueryCtx.

Actually, it not related with q49, every query has possibility to reproduce.

Thanks @Yohahaha , could you point me where is the Task ref you mentioned, maybe we should reset that shared_ptr after data source switching.

Yohahaha commented 1 year ago

https://github.com/facebookincubator/velox/pull/4849/files#diff-b04232f155cc15faca93b69cf34dd80e29b534a4e54dd774bf8bdbedbfc66d6bR262-R265

you can try remove taskHolder.

zhli1142015 commented 1 year ago

Thanks, reset taskHolder would slove the crash. But memory leak and the second error are still there, then looks perload may cause real memory leak issue

Yohahaha commented 1 year ago

I mean delete taskHolder variable.

            executor->add([split]() mutable {
              split->dataSource->prepare();
              split.reset();
            });

could you post lastest failed log?

ulysses-you commented 1 year ago

@Yohahaha do you know the minimize re-procude query ?

zhli1142015 commented 1 year ago

I mean delete taskHolder variable.

            executor->add([split]() mutable {
              split->dataSource->prepare();
              split.reset();
            });

could you post lastest failed log?

Yes, remove it slove all issues, thanks I think we may open an issue in velox and fix there, @Yohahaha would you like to follow up this?

Yohahaha commented 1 year ago

@Yohahaha do you know the minimize re-procude query ?

I can reproduce with tpch q10 with sf10.

Yohahaha commented 1 year ago

I mean delete taskHolder variable.
            executor->add([split]() mutable {
              split->dataSource->prepare();
              split.reset();
            });
could you post lastest failed log?
Yes, remove it slove all issues, thanks I think we may open an issue in velox and fix there, @Yohahaha would you like to follow up this?

I'm not sure this is root cause and can not prevent following patches wont introduce any ref again, current usage of MemoryManagement and split preload feature in Gluten breaks Velox design, Velox suppose MemoryManager is static but Gluten need it be controllable.

I would like to open an issue in Velox, but not optimistic Velox would fix it. Github does not show git message for PR4849, I copy that from git shell.

Summary:
The async prefetch split might outlive its associated task during the async prefetch.
If that happens, we might see leak in table scan memory pool we found in Prestissimo
test because of the memory allocation in the split. Considering the old task cleanup
mechanism in presto cpp, the chance is very small and shall be some extreme time race.

zhli1142015 commented 1 year ago

Thanks for your help, make sense to me.

zhouyuan commented 1 year ago

@Yohahaha

I find reverting some code from below patch can fix this issue. the shared_ptr based memory pool is able to work with multi-thread IO pool. https://github.com/oap-project/gluten/pull/3120/files

-yuan

Yohahaha commented 1 year ago

@zhouyuan I dont think 3120 has relation with this issue, as I mentioned above, this problem is due to current thread model and object lifecycle management, and has possibility occur with split preload which use per Spark Executor thread pool.

One possible solution is move thread pool to ExecutionCtx, let it has same lifecycle with ExecutionCtx as well as Spark Task, but it need change lots of Velox code.

If you confirm 3120 trigger this issue, I agree to revert it.

Yohahaha commented 1 year ago

BTW, I suggest to change issue title more accurately.

zhouyuan commented 1 year ago

@zhouyuan I dont think 3120 has relation with this issue, as I mentioned above, this problem is due to current thread model and object lifecycle management, and has possibility occur with split preload which use per Spark Executor thread pool.

One possible solution is move thread pool to ExecutionCtx, let it has same lifecycle with ExecutionCtx as well as Spark Task, but it need change lots of Velox code.

If you confirm 3120 trigger this issue, I agree to revert it.

Yes, the older commit is working on my env. no need to revert, #3120 makes sense to me. I'll check if we could do a quick fix.

thanks, -yuan

zhouyuan commented 1 year ago

@zhouyuan I dont think 3120 has relation with this issue, as I mentioned above, this problem is due to current thread model and object lifecycle management, and has possibility occur with split preload which use per Spark Executor thread pool. One possible solution is move thread pool to ExecutionCtx, let it has same lifecycle with ExecutionCtx as well as Spark Task, but it need change lots of Velox code. If you confirm 3120 trigger this issue, I agree to revert it.

Yes, the older commit is working on my env. no need to revert, #3120 makes sense to me. I'll check if we could do a quick fix.

thanks, -yuan

@Yohahaha I tried to reproduce this in a new bigger setup, it seems the issue/fix is not able reproduce. So I think this may due to some flaky issues, sorry for this false alert.

thanks, -yuan

zhztheplayer commented 1 year ago

@zhli1142015 @Yohahaha I am proposing https://github.com/oap-project/gluten/pull/3419 which probably can fix the issue, I'll find some time to try reproducing and testing then get back to you.

apache / incubator-gluten