dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.28k stars 8.73k forks source link

Exception when cleaning up ThreadLocalStore learner when using Cuda #9751

Open eclrbohnhoff opened 1 year ago

eclrbohnhoff commented 1 year ago

Using VS 17.5.3 and the C API with branch release_2.0.0

Creating a "worker thread" to fit a model using XGBoosterUpdateOneIter() and using the option XGBoosterSetParam(booster, "device", "cuda");. When worker thread exits, I get a dmlc::Error exception with this call stack:

    xgboost.dll!dmlc::LogMessageFatal::~LogMessageFatal() Line 428  C++
    xgboost.dll!dh::ThrowOnCudaError(enum cudaError,char const *,int)   C++
    xgboost.dll!xgboost::HostDeviceVectorImpl<float>::SetDevice(void)   C++
    xgboost.dll!xgboost::HostDeviceVectorImpl<float>::`scalar deleting destructor'(unsigned int)    C++
    xgboost.dll!xgboost::HostDeviceVector<float>::~HostDeviceVector<float>(void)    C++
    xgboost.dll!xgboost::XGBAPIThreadLocalEntry::~XGBAPIThreadLocalEntry()  C++
    xgboost.dll!std::_Tree_val<std::_Tree_simple_types<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>>>::_Erase_tree<std::allocator<std::_Tree_node<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>,void *>>>(std::allocator<std::_Tree_node<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>,void *>> & _Al, std::_Tree_node<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>,void *> * _Rootnode) Line 747 C++
    [Inline Frame] xgboost.dll!std::_Tree_val<std::_Tree_simple_types<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>>>::_Erase_head(std::allocator<std::_Tree_node<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>,void *>> &) Line 754    C++
    [Inline Frame] xgboost.dll!std::_Tree<std::_Tmap_traits<xgboost::Learner const *,xgboost::XGBAPIThreadLocalEntry,std::less<xgboost::Learner const *>,std::allocator<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>>,0>>::{dtor}() Line 1081  C++
    xgboost.dll!`dmlc::ThreadLocalStore<std::map<xgboost::Learner const *,xgboost::XGBAPIThreadLocalEntry,std::less<xgboost::Learner const *>,std::allocator<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>>>>::Get'::`2'::`dynamic atexit destructor for 'inst''()  C++
    xgboost.dll!__dyn_tls_dtor(void * __formal, const unsigned long dwReason, void * __formal) Line 119 C++

I do not get this exception when I use the cpu device version.

trivialfis commented 1 year ago

Could you please share the error message? Also, how does the worker thread exit, is the OS reclaiming the thread or are you joining the thread yourself?

eclrbohnhoff commented 1 year ago
[09:06:21] C:\eclwork\Source\3rdParty\xgboost\xgboost\src\common\common.h:45: C:\eclwork\Source\3rdParty\xgboost\xgboost\src\common\host_device_vector.cu: 264: cudaErrorInitializationError: initialization error

It looks like that calling XGBoosterPredictFromDMatrix() is what is triggering this condition.

The worker thread is a normal std::thread and join() is being called by the main thread after the thread is signaled to return.

Functions are called in this order:

XGDMatrixCreateFromCallback() //training data Xy
XGDMatrixCreateFromCallback() //test data XyTest

callbacks are using:

XGProxyDMatrixSetDataDense()
XGDMatrixSetDenseInfo() // setting "label"

The rest of the calls

XGBoosterCreate() // cache of Xy and XyTest
XGBoosterSetParam() //multiple calls
for () {
   XGBoosterUpdateOneIter() // Xy
   XGBoosterEvalOneIter()
}
XGDMatrixFree(Xy)
XGBoosterPredictFromDMatrix() //XyTest
XGDMatrixGetFloatInfo(XyTest, "label"...)
XGDMatrixFree(XyTest)

Thread returns and main thread takes ownership of the BoosterHandle

trivialfis commented 1 year ago

Unfortunately, I don't quite understand the cause of this at the moment. I don't use Windows myself. Based on the error message, my guess is that during the destruction of the thread's local memory, the cuda runtime context is destroyed by the system before XGBoost can free up its device memory.

If my guess is correct, then we will have to invent a new predict function that returns a memory handle and ask the users to manage the returned prediction buffer.