apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.15k stars 417 forks source link

Velox memory manager throw exception #4106

Closed ulysses-you closed 9 months ago

ulysses-you commented 9 months ago

Backend

VL (Velox)

Bug description

It seems the issue is caused by https://github.com/facebookincubator/velox/pull/7168

java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: The memory manager is not set
Retriable: False
Expression: instanceRef != nullptr
Function: getInstance
File: ../../velox/common/memory/Memory.cpp
Line: 129
Stack trace:
# 0  facebook::velox::VeloxException::VeloxException(char const*, unsigned long, char const*, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, bool, facebook::velox::VeloxException::Type, std::basic_string_view<char, std::char_traits<char> >)
# 1  void facebook::velox::detail::veloxCheckFail<facebook::velox::VeloxRuntimeError, char const*>(facebook::velox::detail::VeloxCheckFailArgs const&, char const*)
# 2  facebook::velox::memory::MemoryManager::getInstance()
# 3  facebook::velox::memory::spillMemoryPool()
# 4  facebook::velox::exec::HashBuild::setupSpiller(facebook::velox::exec::SpillPartition*)
# 5  facebook::velox::exec::HashBuild::HashBuild(int, facebook::velox::exec::DriverCtx*, std::shared_ptr<facebook::velox::core::HashJoinNode const>)
# 6  std::_Function_handler<std::unique_ptr<facebook::velox::exec::Operator, std::default_delete<facebook::velox::exec::Operator> > (int, facebook::velox::exec::DriverCtx*), facebook::velox::exec::detail::makeConsumerSupplier(std::shared_ptr<facebook::velox::core::PlanNode const> const&)::{lambda(int, facebook::velox::exec::DriverCtx*)#3}>::_M_invoke(std::_Any_data const&, int&&, facebook::velox::exec::DriverCtx*&&)
# 7  facebook::velox::exec::DriverFactory::createDriver(std::unique_ptr<facebook::velox::exec::DriverCtx, std::default_delete<facebook::velox::exec::DriverCtx> >, std::shared_ptr<facebook::velox::exec::ExchangeClient>, std::function<int (int)>)
# 8  facebook::velox::exec::Task::createDriversLocked(unsigned int)
# 9  facebook::velox::exec::Task::next(folly::SemiFuture<folly::Unit>*)
# 10 gluten::WholeStageResultIterator::next()
# 11 Java_io_glutenproject_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 12 0x00007fc37211f4a8
image

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

ulysses-you commented 9 months ago

cc @zhztheplayer @zhouyuan @PHILO-HE

zhouyuan commented 9 months ago

@ulysses-you thanks, I ran int this in some unit tests with the default memory manager used. The TPC-H/DS runs ok w/o this patch.
https://github.com/oap-project/gluten/blob/main/cpp/velox/memory/VeloxMemoryManager.h#L84

@zhztheplayer do you see the RCA why the memory manager is not initialized?

thanks, -yuan

zhztheplayer commented 9 months ago

Somewhere in Velox still requests for the default memory pool. In the case facebook::velox::memory::spillMemoryPool().

Another options is to see if we could pass a context pool for spilling use, but that can be another topic to me.

zhouyuan commented 9 months ago

@zhztheplayer the spill memory pool should also use "the same" memory manager instance. I was wondering why tpch/ds does not fail if the memory manager singleton instance is not registered.

zhztheplayer commented 9 months ago

@zhztheplayer the spill memory pool should also use "the same" memory manager instance. I was wondering why tpch/ds does not fail if the memory manager singleton instance is not registered.

It's because Velox still has some code calling a function named deprecatedDefaultMemoryManager which sets the memory manager if missing. Not sure why the function not called in this issue's case. @ulysses-you if you could help investigating? I see the memory manager is being set via the following call stack:

facebook::velox::memory::instance Memory.cpp:34
facebook::velox::memory::MemoryManager::deprecatedGetInstance Memory.cpp:107
facebook::velox::memory::deprecatedDefaultMemoryManager Memory.cpp:290
facebook::velox::core::QueryCtx::initPool QueryCtx.h:139
facebook::velox::core::QueryCtx::QueryCtx QueryCtx.cpp:36
Java_io_glutenproject_vectorized_PlanEvaluatorJniWrapper_nativeValidateWithFailureReason VeloxJniWrapper.cc:109
ulysses-you commented 9 months ago

@zhztheplayer it seems the nativeValidateWithFailureReason is only called at driver ?

zhztheplayer commented 9 months ago

@zhztheplayer it seems the nativeValidateWithFailureReason is only called at driver ?

Aha correct. I was testing with local mode. Probably we should add some for local-cluster mode later.

FelixYBW commented 9 months ago

Is it a new error intorduced by rebase?

ulysses-you commented 9 months ago

@FelixYBW it is intriduced since 12.18 rebase