dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.28k stars 8.73k forks source link

Access violation in tree::ColMaker::Builder::EnumerateSplit() on Windows #6084

Open srogatch opened 4 years ago

srogatch commented 4 years ago

I am getting the same error in XGBoost: tried versions 1.0 and 1.2. I never got such an error with a less parallelized processor. The processor on which I started to get the error is Ryzen Threadripper 3990X (64 physical, 128 logical cores). The call stack is:

>   xgboost.dll!xgboost::tree::ColMaker::Builder::EnumerateSplit(const xgboost::Entry * begin, const xgboost::Entry * end, int d_step, unsigned int fid, const std::vector<xgboost::detail::GradientPairInternal<float>,std::allocator<xgboost::detail::GradientPairInternal<float>>> & gpair, std::vector<xgboost::tree::ColMaker::ThreadEntry,std::allocator<xgboost::tree::ColMaker::ThreadEntry>> & temp) Line 395  C++
    xgboost.dll!xgboost::tree::ColMaker::Builder::UpdateSolution$omp$1() Line 456   C++
    vcomp140.dll!__vcomp_fork_helper() Line 91  Unknown
    vcomp140.dll!_vcomp::fork_helper_wrapper(void(*)() funclet, int arg_count, char * argptr) Line 348  C++
    vcomp140.dll!_vcomp::ParallelRegion::HandlerThreadFunc(void * context, unsigned long index) Line 329    C++
    vcomp140.dll!_vcomp::PersistentThreadFunc(void * pvContext) Line 240    C++
    kernel32.dll!BaseThreadInitThunk() Unknown
    ntdll.dll!RtlUserThreadStart() Unknown

The lines that throw are:

        for (i = 0, p = it; i < kBuffer; ++i, p += d_step) {
          buf_position[i] = position_[p->index];
          buf_gpair[i] = gpair[p->index]; // The debugger arraw is pointing here
        }

The error is:

Exception thrown at 0x00007FFD6E459706 (xgboost.dll) in BatchXgbTrainClose.exe: 0xC0000005: Access violation reading location 0x0000016652B2D930.

If the debugger shows it to me correctly, the value of p->index is 1074375450 .

The full call stack of the main (i.e. not an OMP) thread that led here is:

>   xgboost.dll!xgboost::tree::ColMaker::Builder::EnumerateSplit(const xgboost::Entry * begin, const xgboost::Entry * end, int d_step, unsigned int fid, const std::vector<xgboost::detail::GradientPairInternal<float>,std::allocator<xgboost::detail::GradientPairInternal<float>>> & gpair, std::vector<xgboost::tree::ColMaker::ThreadEntry,std::allocator<xgboost::tree::ColMaker::ThreadEntry>> & temp) Line 399  C++
    xgboost.dll!xgboost::tree::ColMaker::Builder::UpdateSolution$omp$1() Line 456   C++
    vcomp140.dll!__vcomp_fork_helper() Line 91  Unknown
    vcomp140.dll!_vcomp::fork_helper_wrapper(void(*)() funclet, int arg_count, char * argptr) Line 348  C++
    vcomp140.dll!_vcomp::ParallelRegion::HandlerThreadFunc(void * context, unsigned long index) Line 329    C++
    vcomp140.dll!InvokeThreadTeam(_THREAD_TEAM * ptm, void(*)(void *, unsigned long) pvContext, void *) Line 842    C++
    vcomp140.dll!_vcomp_fork(int if_test, int arg_count, void(*)() funclet, ...) Line 230   C++
    xgboost.dll!xgboost::tree::ColMaker::Builder::UpdateSolution(const xgboost::SparsePage & batch, const std::vector<unsigned int,std::allocator<unsigned int>> & feat_set, const std::vector<xgboost::detail::GradientPairInternal<float>,std::allocator<xgboost::detail::GradientPairInternal<float>>> & gpair, xgboost::DMatrix * p_fmat) Line 474  C++
    xgboost.dll!xgboost::tree::ColMaker::Builder::FindSplit(int depth, const std::vector<int,std::allocator<int>> & qexpand, const std::vector<xgboost::detail::GradientPairInternal<float>,std::allocator<xgboost::detail::GradientPairInternal<float>>> & gpair, xgboost::DMatrix * p_fmat, xgboost::RegTree * p_tree) Line 482   C++
    xgboost.dll!xgboost::tree::ColMaker::Builder::Update(const std::vector<xgboost::detail::GradientPairInternal<float>,std::allocator<xgboost::detail::GradientPairInternal<float>>> & gpair, xgboost::DMatrix * p_fmat, xgboost::RegTree * p_tree) Line 177   C++
    xgboost.dll!xgboost::tree::ColMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float>> * gpair, xgboost::DMatrix * dmat, const std::vector<xgboost::RegTree *,std::allocator<xgboost::RegTree *>> & trees) Line 116    C++
    xgboost.dll!xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float>> * gpair, xgboost::DMatrix * p_fmat, int bst_group, std::vector<std::unique_ptr<xgboost::RegTree,std::default_delete<xgboost::RegTree>>,std::allocator<std::unique_ptr<xgboost::RegTree,std::default_delete<xgboost::RegTree>>>> * ret) Line 301 C++
    xgboost.dll!xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix * p_fmat, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float>> * in_gpair, xgboost::PredictionCacheEntry * predt) Line 196    C++
    xgboost.dll!xgboost::LearnerImpl::UpdateOneIter(int iter, std::shared_ptr<xgboost::DMatrix> train) Line 977 C++
    xgboost.dll!XGBoosterUpdateOneIter(void * handle, int iter, void * dtrain) Line 441 C++

The code is compiled with a few versions of MSVS2019, the latest is 16.7.2 used for compilation of XGBoost 1.2.0 . CUDA was enabled during the compilation, but the CPU exact method is used in this experiment.

Sorry that I can't provide the program used for training for several reasons, e.g. it's large and it requires external data to train on. Below is what I can provide instead:

Below you can find the parameters I used for XGBoost:

    safe_xgboost(XGBoosterSetParam(h_booster, "tree_method", "exact"));
    safe_xgboost(XGBoosterSetParam(h_booster, "colsample_bytree", "0.5"));
    safe_xgboost(XGBoosterSetParam(h_booster, "colsample_bylevel", "0.5"));
    safe_xgboost(XGBoosterSetParam(h_booster, "colsample_bynode", "0.5"));
    safe_xgboost(XGBoosterSetParam(h_booster, "gamma", "100000"));
    safe_xgboost(XGBoosterSetParam(h_booster, "reg_alpha", "10000"));
    safe_xgboost(XGBoosterSetParam(h_booster, "reg_lambda", "10000"));
    safe_xgboost(XGBoosterSetParam(h_booster, "learning_rate", "0.1"));
    safe_xgboost(XGBoosterSetParam(h_booster, "max_depth", "11"));

Please, let me know how else I can help you with troubleshooting the issue and thank you in advance for looking into it.

hcho3 commented 4 years ago

Thanks for the bug report.

Are you using the C API? Can you reproduce the error using the Python package?

Also, it would be great if we can reproduce this problem using Linux, since most of us contributors use Linux for daily development.

srogatch commented 4 years ago

Are you using the C API? Can you reproduce the error using the Python package?

Yes, I'm using C API without any Python. For the reasons explained above a reproducer is problematic to provide, however, I can provide you with a minidump if you give me the location to upload ~8GB archive.

Also, it would be great if we can reproduce this problem using Linux, since most of us contributors use Linux for daily development.

For now, I only have a minidump that you can open in MSVS2019 or (probably) WinDBG to try an initial investigation.

Please, note that the problem happens rarely - once in an hour or two of XGBoost work on slightly different datasets (labels are changed, features stay the same). So minidump analysis seems a more viable way than trying to come up with a reproducer and running it for hours until it hopefully triggers the problem.

hcho3 commented 4 years ago

For now, I only have a minidump that you can open in MSVS2019 or (probably) WinDBG to try an initial investigation.

Windows programming is not my area of expertise, so I'm afraid I won't be useful here. I'll keep this issue open for now and see if anyone else can help.

(If the issue were on Linux, I'd be able to use valgrind and memory sanitizer to try to locate a possible memory issue. )

hcho3 commented 4 years ago

Also one advice: the exact algorithm has received relatively little attention recently. Most of the active development happens in the hist and gpu_hist algorithms. You may have better success with using hist and gpu_hist algorithms.

srogatch commented 4 years ago

Also one advice: the exact algorithm has received relatively little attention recently. Most of the active development happens in the hist and gpu_hist algorithms. You may have better success with using hist and gpu_hist algorithms.

Thank you for the advice. I'm using the exact algorithm because on my dataset the performance of XGBoost is just a slightly better than random guess for classification or the mean constant for regression. So every bit of accuracy matters and I would prefer to wait more for the exact algorithm to complete than to use hist/gpu_hist approximations that bring the performance closer to a random guess/mean.

I'll try to come up with a Python reproducer, though this may take quite some time because I need to first implement it in a sensible way and then to wait for hours to see if the problem gets reproduced or not, and if it doesn't, I need to make the reproducer somewhat closer to the real use case implemented in C++.

hcho3 commented 4 years ago

I would prefer to wait more for the exact algorithm to complete than to use hist/gpu_hist approximations

I understand. If you do decide to use hist algorithm, you should set max_bin to a large number (like 1024 or bigger) to reduce the impact of the approximations. This number controls how many candidate thresholds will be considered for each split.