apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[VL] Found dead lock when memory arbitration #7800

Open Yohahaha opened 2 weeks ago

Yohahaha commented 2 weeks ago

Backend

VL (Velox)

Bug description


Thread 367 (Thread 0x7f69ed5ff640 (LWP 821)):
#0  0x00007f6a1a8619cd in syscall () from /lib64/libc.so.6
#1  0x00007f695a9fa407 in folly::detail::(anonymous namespace)::nativeFutexWaitImpl (addr=0x7f69ed5f9808, expected=4294967293, absSystemTime=0x0, absSteadyTime=0x0, waitMask=4294967295) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/Futex.cpp:126
#2  0x00007f695a9fa5fe in folly::detail::futexWaitImpl (futex=0x7f69ed5f9808, expected=4294967293, absSystemTime=0x0, absSteadyTime=0x0, waitMask=4294967295) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/Futex.cpp:254
#3  0x00007f695a9a233d in folly::detail::futexWait<std::atomic<unsigned int> > (futex=0x7f69ed5f9808, expected=4294967293, waitMask=4294967295) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/Futex-inl.h:96
#4  0x00007f695aad4f4c in folly::detail::MemoryIdler::futexWait<std::atomic<unsigned int>, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (fut=..., expected=4294967293, waitMask=4294967295, idleTimeout=..., stackToRetain=1024, timeoutVariationFrac=0.5) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/MemoryIdler.h:128
#5  0x00007f695aad2458 in folly::fibers::Baton::waitThread (this=0x7f69ed5f9808) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/fibers/Baton.cpp:70
#6  0x00007f695aad29a5 in folly::fibers::Baton::wait<folly::fibers::Baton::wait()::<lambda()> >(struct {...} &&) (this=0x7f69ed5f9808, mainContextFunc=...) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/fibers/Baton-inl.h:54
#7  0x00007f695aad2249 in folly::fibers::Baton::wait (this=0x7f69ed5f9808) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/fibers/Baton.cpp:46
#8  0x00007f69577e229b in void folly::futures::detail::waitImpl<folly::SemiFuture<folly::Unit>, folly::Unit>(folly::SemiFuture<folly::Unit>&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#9  0x00007f69596db1aa in facebook::velox::exec::Task::MemoryReclaimer::reclaimTask(std::shared_ptr<facebook::velox::exec::Task> const&, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#10 0x00007f69596db871 in facebook::velox::exec::Task::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#11 0x00007f695a91959a in facebook::velox::memory::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#12 0x00007f695780e279 in gluten::ListenableArbitrator::shrinkCapacity(std::vector<std::shared_ptr<facebook::velox::memory::MemoryPool>, std::allocator<std::shared_ptr<facebook::velox::memory::MemoryPool> > > const&, unsigned long, bool, bool) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#13 0x00007f69577ed017 in gluten::WholeStageResultIterator::spillFixedSize(long) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#14 0x00007f695f2a367e in Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeSpill () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libgluten.so
#15 0x00007f6a090185a7 in ?? ()
#16 0x00007f6933af7690 in ?? ()
#17 0x00007f69ed5f9eb0 in ?? ()
#18 0x00007f69644a9800 in ?? ()
#19 0x00007f695f29ceb6 in Java_org_apache_gluten_exec_RuntimeJniWrapper_shrinkMemory () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libgluten.so
#20 0x00007f69ed5f9e90 in ?? ()
#21 0x00007f6a09007b10 in ?? ()
#22 0x0000000000000000 in ?? ()

Thread 366 (Thread 0x7f69f2be0640 (LWP 818)):
#0  0x00007f6a1a8619cd in syscall () from /lib64/libc.so.6
#1  0x00007f695a9fa407 in folly::detail::(anonymous namespace)::nativeFutexWaitImpl (addr=0x7f69f2bda848, expected=4294967293, absSystemTime=0x0, absSteadyTime=0x0, waitMask=4294967295) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/Futex.cpp:126
#2  0x00007f695a9fa5fe in folly::detail::futexWaitImpl (futex=0x7f69f2bda848, expected=4294967293, absSystemTime=0x0, absSteadyTime=0x0, waitMask=4294967295) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/Futex.cpp:254
#3  0x00007f695a9a233d in folly::detail::futexWait<std::atomic<unsigned int> > (futex=0x7f69f2bda848, expected=4294967293, waitMask=4294967295) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/Futex-inl.h:96
#4  0x00007f695aad4f4c in folly::detail::MemoryIdler::futexWait<std::atomic<unsigned int>, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (fut=..., expected=4294967293, waitMask=4294967295, idleTimeout=..., stackToRetain=1024, timeoutVariationFrac=0.5) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/MemoryIdler.h:128
#5  0x00007f695aad2458 in folly::fibers::Baton::waitThread (this=0x7f69f2bda848) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/fibers/Baton.cpp:70
#6  0x00007f695aad29a5 in folly::fibers::Baton::wait<folly::fibers::Baton::wait()::<lambda()> >(struct {...} &&) (this=0x7f69f2bda848, mainContextFunc=...) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/fibers/Baton-inl.h:54
#7  0x00007f695aad2249 in folly::fibers::Baton::wait (this=0x7f69f2bda848) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/fibers/Baton.cpp:46
#8  0x00007f69577e229b in void folly::futures::detail::waitImpl<folly::SemiFuture<folly::Unit>, folly::Unit>(folly::SemiFuture<folly::Unit>&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#9  0x00007f69596db1aa in facebook::velox::exec::Task::MemoryReclaimer::reclaimTask(std::shared_ptr<facebook::velox::exec::Task> const&, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#10 0x00007f69596db871 in facebook::velox::exec::Task::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#11 0x00007f695a91959a in facebook::velox::memory::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#12 0x00007f695780e279 in gluten::ListenableArbitrator::shrinkCapacity(std::vector<std::shared_ptr<facebook::velox::memory::MemoryPool>, std::allocator<std::shared_ptr<facebook::velox::memory::MemoryPool> > > const&, unsigned long, bool, bool) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#13 0x00007f69577ed017 in gluten::WholeStageResultIterator::spillFixedSize(long) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#14 0x00007f695f2a367e in Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeSpill () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libgluten.so
#15 0x00007f6a090185a7 in ?? ()
#16 0x00000007c0001068 in ?? ()
#17 0x0000000000000001 in ?? ()
#18 0x00007f69644a6800 in ?? ()
#19 0x00007f69644a6800 in ?? ()
#20 0x00007f69f2bdae90 in ?? ()
#21 0x00007f69f2bdae18 in ?? ()
#22 0x0000000000000000 in ?? ()

Thread 365 (Thread 0x7f69ee5fd640 (LWP 815)):
#0  0x00007f6a1a8619cd in syscall () from /lib64/libc.so.6
#1  0x00007f695a9fa407 in folly::detail::(anonymous namespace)::nativeFutexWaitImpl (addr=0x7f69ee5f7768, expected=4294967293, absSystemTime=0x0, absSteadyTime=0x0, waitMask=4294967295) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/Futex.cpp:126
#2  0x00007f695a9fa5fe in folly::detail::futexWaitImpl (futex=0x7f69ee5f7768, expected=4294967293, absSystemTime=0x0, absSteadyTime=0x0, waitMask=4294967295) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/Futex.cpp:254
#3  0x00007f695a9a233d in folly::detail::futexWait<std::atomic<unsigned int> > (futex=0x7f69ee5f7768, expected=4294967293, waitMask=4294967295) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/Futex-inl.h:96
#4  0x00007f695aad4f4c in folly::detail::MemoryIdler::futexWait<std::atomic<unsigned int>, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (fut=..., expected=4294967293, waitMask=4294967295, idleTimeout=..., stackToRetain=1024, timeoutVariationFrac=0.5) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/detail/MemoryIdler.h:128
#5  0x00007f695aad2458 in folly::fibers::Baton::waitThread (this=0x7f69ee5f7768) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/fibers/Baton.cpp:70
#6  0x00007f695aad29a5 in folly::fibers::Baton::wait<folly::fibers::Baton::wait()::<lambda()> >(struct {...} &&) (this=0x7f69ee5f7768, mainContextFunc=...) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/fibers/Baton-inl.h:54
#7  0x00007f695aad2249 in folly::fibers::Baton::wait (this=0x7f69ee5f7768) at /home/admin/gluten/ep/build-velox/build/velox_ep/folly/folly/fibers/Baton.cpp:46
#8  0x00007f69577e229b in void folly::futures::detail::waitImpl<folly::SemiFuture<folly::Unit>, folly::Unit>(folly::SemiFuture<folly::Unit>&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#9  0x00007f69596db1aa in facebook::velox::exec::Task::MemoryReclaimer::reclaimTask(std::shared_ptr<facebook::velox::exec::Task> const&, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#10 0x00007f69596db871 in facebook::velox::exec::Task::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#11 0x00007f695a91959a in facebook::velox::memory::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#12 0x00007f695780e279 in gluten::ListenableArbitrator::shrinkCapacity(std::vector<std::shared_ptr<facebook::velox::memory::MemoryPool>, std::allocator<std::shared_ptr<facebook::velox::memory::MemoryPool> > > const&, unsigned long, bool, bool) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#13 0x00007f69577ed017 in gluten::WholeStageResultIterator::spillFixedSize(long) () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libvelox.so
#14 0x00007f695f2a367e in Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeSpill () from /var/data/spark-764eafc1-9cc9-46aa-bae3-833730d80c6e/gluten-e4e67167-347b-45b1-8391-7331a15e6f5b/jni/3c4c9166-80c3-4084-ae2e-c1a0ae7dec19/gluten-5857544276140517290/libgluten.so
#15 0x00007f6a090185a7 in ?? ()
#16 0x0000000000000000 in ?? ()

plan

glutenPlan -> vanilla project -> glutenPlan -> vanilla writer

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

zhztheplayer commented 2 weeks ago

Does your code include this change?

Yohahaha commented 2 weeks ago

Does your code include this change?

not yet, I will try upgrade this month. thank you!

Yohahaha commented 2 weeks ago

Does your code include this change?

but I think add reclaim timeout is still needed.

zhztheplayer commented 2 weeks ago

Does your code include this change?

but I think add reclaim timeout is still needed.

OK I see PR https://github.com/apache/incubator-gluten/pull/7799. Will take a closer look once soon. Thanks.