dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.22k stars 8.72k forks source link

/root/repo/xgboost/src/linear/../common/device_helpers.cuh(872): an illegal memory access was encountered Aborted (core dumped) #3612

Closed pseudotensor closed 5 years ago

pseudotensor commented 6 years ago

illegal.zip

In GLM.

@RAMitchell

#0  0x00007f4ef62b0428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f4ef62b202a in __GI_abort () at abort.c:89
#2  0x00007f4ea06d084d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f4ea06ce6b6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f4ea06cd6a9 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f4ea06ce005 in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f4ea043af83 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00007f4ea043b487 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8  0x00007f4e7f86cf71 in dh::ThrowOnCudaError (line=872, file=0x7f4e7f97aca0 "/root/repo/xgboost/src/linear/../common/device_helpers.cuh", code=cudaErrorIllegalAddress) at /root/repo/xgboost/src/linear/../common/device_helpers.cuh:44
#9  dh::SumReduction<thrust::permutation_iterator<thrust::device_ptr<xgboost::detail::GradientPairInternal<float> >, thrust::transform_iterator<__nv_dl_wrapper_t<__nv_dl_tag<xgboost::detail::GradientPairInternal<float> (xgboost::linear::DeviceShard::*)(int, int), &xgboost::linear::DeviceShard::GetBiasGradient, 1u>, int, int>, thrust::counting_iterator<unsigned long long, thrust::use_default, thrust::use_default, thrust::use_default>, unsigned long, thrust::use_default> > > (
    tmp_mem=..., in=..., nVals=<optimized out>) at /root/repo/xgboost/src/linear/../common/device_helpers.cuh:871
#10 0x00007f4e7f86805b in xgboost::linear::DeviceShard::GetBiasGradient (num_group=<optimized out>, group_idx=<optimized out>, this=<optimized out>) at /root/repo/xgboost/src/linear/updater_gpu_coordinate.cu:158
#11 xgboost::linear::GPUCoordinateUpdater::UpdateBias(xgboost::DMatrix*, xgboost::gbm::GBLinearModel*)::{lambda(std::unique_ptr<xgboost::linear::DeviceShard, std::default_delete<xgboost::linear::DeviceShard> >&)#1}::operator()(std::unique_ptr<xgboost::linear::DeviceShard, std::default_delete<xgboost::linear::DeviceShard> >&) const (shard=..., __closure=<optimized out>) at /root/repo/xgboost/src/linear/updater_gpu_coordinate.cu:294
#12 dh::ReduceShards<xgboost::detail::GradientPairInternal<float>, std::unique_ptr<xgboost::linear::DeviceShard, std::default_delete<xgboost::linear::DeviceShard> >, xgboost::linear::GPUCoordinateUpdater::UpdateBias(xgboost::DMatrix*, xgboost::gbm::GBLinearModel*)::{lambda(std::unique_ptr<xgboost::linear::DeviceShard, std::default_delete<xgboost::linear::DeviceShard> >&)#1}> () at /root/repo/xgboost/src/linear/../common/device_helpers.cuh:1117
#13 0x00007f4e7f873a48 in dh::ReduceShards<xgboost::detail::GradientPairInternal<float>, std::unique_ptr<xgboost::linear::DeviceShard>, xgboost::linear::GPUCoordinateUpdater::UpdateBias(xgboost::DMatrix*, xgboost::gbm::GBLinearModel*)::__lambda6> (f=..., shards=<optimized out>) at /root/repo/xgboost/src/linear/../common/device_helpers.cuh:1115
#14 xgboost::linear::GPUCoordinateUpdater::UpdateBias (model=0x1c6c3c0, p_fmat=0xd50ae0, this=0x1c6c860) at /root/repo/xgboost/src/linear/updater_gpu_coordinate.cu:294
#15 xgboost::linear::GPUCoordinateUpdater::Update (this=0x1c6c860, in_gpair=0x1c6ab90, p_fmat=0xd50ae0, model=0x1c6c3c0, sum_instance_weight=<optimized out>) at /root/repo/xgboost/src/linear/updater_gpu_coordinate.cu:266
#16 0x00007f4e7f6e9167 in xgboost::gbm::GBLinear::DoBoost (this=0x1c6c3b0, p_fmat=0xd50ae0, in_gpair=0x1c6ab90, obj=<optimized out>) at /root/repo/xgboost/src/gbm/gblinear.cc:100
#17 0x00007f4e7f70f350 in xgboost::LearnerImpl::UpdateOneIter (this=0x1c6aa30, iter=0, train=0xd50ae0) at /root/repo/xgboost/src/learner.cc:403
#18 0x00007f4e7f68ba45 in XGBoosterUpdateOneIter (handle=0x1c6a9e0, iter=0, dtrain=0x1be2670) at /root/repo/xgboost/src/c_api/c_api.cc:863
#19 0x00007f4ef4e9fe40 in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#20 0x00007f4ef4e9f8ab in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#21 0x00007f4ef50f8d2f in _call_function_pointer (argcount=3, resmem=0x7ffc57550b80, restype=<optimized out>, atypes=<optimized out>, avalues=0x7ffc57550b50, pProc=0x7f4e7f68ba10 <XGBoosterUpdateOneIter(BoosterHandle, int, DMatrixHandle)>, 
    flags=4353) at /home/tmp/python-build.20180212045501.4616/Python-3.6.4/Modules/_ctypes/callproc.c:809
---Type <return> to continue, or q <return> to quit---
pseudotensor commented 6 years ago

Overall happens about 5% of the time for various datasets.

RAMitchell commented 6 years ago

I cannot reproduce on Windows or Ubuntu using a single GPU and xgboost master branch. Are you using multiple GPUs? What else could be different.

trivialfis commented 5 years ago

@pseudotensor Should be addressed in 97984f4890e1b6357d8f98d121f2050220852998 Feel free to reopen if it doesn't work on your side.