Open eclrbohnhoff opened 1 year ago
Could you please share the error message? Also, how does the worker thread exit, is the OS reclaiming the thread or are you joining the thread yourself?
[09:06:21] C:\eclwork\Source\3rdParty\xgboost\xgboost\src\common\common.h:45: C:\eclwork\Source\3rdParty\xgboost\xgboost\src\common\host_device_vector.cu: 264: cudaErrorInitializationError: initialization error
It looks like that calling XGBoosterPredictFromDMatrix() is what is triggering this condition.
The worker thread is a normal std::thread and join() is being called by the main thread after the thread is signaled to return.
Functions are called in this order:
XGDMatrixCreateFromCallback() //training data Xy
XGDMatrixCreateFromCallback() //test data XyTest
callbacks are using:
XGProxyDMatrixSetDataDense()
XGDMatrixSetDenseInfo() // setting "label"
The rest of the calls
XGBoosterCreate() // cache of Xy and XyTest
XGBoosterSetParam() //multiple calls
for () {
XGBoosterUpdateOneIter() // Xy
XGBoosterEvalOneIter()
}
XGDMatrixFree(Xy)
XGBoosterPredictFromDMatrix() //XyTest
XGDMatrixGetFloatInfo(XyTest, "label"...)
XGDMatrixFree(XyTest)
Thread returns and main thread takes ownership of the BoosterHandle
Unfortunately, I don't quite understand the cause of this at the moment. I don't use Windows myself. Based on the error message, my guess is that during the destruction of the thread's local memory, the cuda runtime context is destroyed by the system before XGBoost can free up its device memory.
If my guess is correct, then we will have to invent a new predict function that returns a memory handle and ask the users to manage the returned prediction buffer.
Using VS 17.5.3 and the C API with branch release_2.0.0
Creating a "worker thread" to fit a model using XGBoosterUpdateOneIter() and using the option
XGBoosterSetParam(booster, "device", "cuda");
. When worker thread exits, I get a dmlc::Error exception with this call stack:I do not get this exception when I use the cpu device version.