dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.35k stars 8.74k forks source link

[bug] Python - Cuda error (without using Cuda) #10171

Open stavoltafunzia opened 7 months ago

stavoltafunzia commented 7 months ago

I've recently upgraded to xgboost version 2.0.3 (Python), and since then I cannot use it anymore as keeps crashing. The following simple code fails to run:

import xgboost as xgb
import numpy as np

train = xgb.DMatrix(np.array([1,2,3]).reshape((-1, 1)), label=np.array([2,3,4]))  # Weird enough, if I don't specify the label the error does not show up

And the error message shows the following traceback:

Traceback (most recent call last):
  File "/home/nicola/mega_workspace/trader/debug_5.py", line 4, in <module>
    train = xgb.DMatrix(np.array([1,2,3]).reshape((-1, 1)), label=np.array([2,3,4]))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 730, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 869, in __init__
    self.set_info(
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 730, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 932, in set_info
    self.set_label(label)
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 1070, in set_label
    dispatch_meta_backend(self, label, "label", "float")
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/data.py", line 1218, in dispatch_meta_backend
    _meta_from_numpy(data, name, dtype, handle)
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/data.py", line 1159, in _meta_from_numpy
    _check_call(_LIB.XGDMatrixSetInfoFromInterface(handle, c_str(field), interface_str))
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 282, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [20:49:47] /home/conda/feedstock_root/build_artifacts/xgboost-split_1712072663242/work/src/data/array_interface.cu:44: Check failed: err == cudaGetLastError() (0 vs. 2) : 
Stack trace:
  [bt] (0) /home/nicola/Software/miniconda3/envs/text_xgb/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f12a9d2164e]
  [bt] (1) /home/nicola/Software/miniconda3/envs/text_xgb/lib/libxgboost.so(xgboost::ArrayInterfaceHandler::IsCudaPtr(void const*)+0xdb) [0x7f12aa3801fb]
  [bt] (2) /home/nicola/Software/miniconda3/envs/text_xgb/lib/libxgboost.so(xgboost::MetaInfo::SetInfo(xgboost::Context const&, xgboost::StringView, xgboost::StringView)+0x126) [0x7f12a9f0f426]
  [bt] (3) /home/nicola/Software/miniconda3/envs/text_xgb/lib/libxgboost.so(XGDMatrixSetInfoFromInterface+0xf7) [0x7f12a9d02927]
  [bt] (4) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052) [0x7f12c86a5052]
  [bt] (5) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925) [0x7f12c86a3925]
  [bt] (6) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde) [0x7f12c86a406e]
  [bt] (7) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92e5) [0x7f12c87bc2e5]
  [bt] (8) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x8837) [0x7f12c87bb837]

It surprises me that it throws an error related to Cuda, even though I'm trying to use only classic CPU xgboost. My configuration is as follows:

xgboost version: 2.0.3 (Python 3.11, clean anaconda environment with only xgboost installed)
OS: Debian 12, with Nvidia drivers 550.54.15  and Cuda 12.4
Hardware: RTX 4000 series card present

The code above used to run flawlessly in Python xgboost 1.7.x.


2024-04-09 update: it turns out that there was another process using my GPU, specifically utilizing almost the entire VRam. After closing such application, the example above works. Nevertheless, I don't know if it should be considered a bug that any xgboost (even CPU-based) application crashes due to issues on the Cuda layer. I leave this decision for the developers (though I personally think it should not happen).

trivialfis commented 7 months ago

Haven't been able to reproduce with CUDA 12.3, trying 12.4 now.

trivialfis commented 7 months ago

Still haven't reproduced it.

trivialfis commented 7 months ago

That's odd, how come that getting the last error is cudaErrorMemoryAllocation.

Except for debian v.s. ubuntu, I have pretty much the same configuration:

OS: Debian 12, with Nvidia drivers 550.54.15  and Cuda 12.4
Hardware: RTX 4000 series card present
stavoltafunzia commented 7 months ago

Apparently, there was another process using my GPU, specifically utilizing almost the entire VRam. After closing such application, the example above works. Nevertheless, I don't know if it should be considered a bug that any xgboost (even CPU-based) application crashes due to issues on the Cuda layer.

trivialfis commented 7 months ago

That makes sense, I will open a PR to workaround that. It's just XGBoost needs to know whether the data is from GPU or CPU, and we use CUDA runtime to obtain this information. As a result, there's a CUDA error when checking the input data.