Closed kevindarby closed 11 months ago
sorry for the long paste, going to try with nvcc11.8 next.
no dice with 11
usr/local/cuda-11/include/crt/host_config.h:132:2: error: #error -- unsupported GNU version! gcc versions later than 11 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk. 132 | #error -- unsupported GNU version!
Hi,
Thanks for reporting; some questions:
Hi, sure, I have pytorch cu118
cuda toolkit 12.1, upgrading to 12.2 (where they say this is fixed: https://github.com/pybind/pybind11/issues/4606)
SKLearn works, but I have a big dataset so I wanted to try the cuda backend. It's just from a notebook I'm working on
I'll try the 12.2 pybind fix, and if that doesn't work I'll try to see if there are nightly builds of pytorch cu122
cuda 12.2 fixed it
/home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-12' Using /home/algo/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... Emitting ninja build file /home/algo/.cache/torch_extensions/py39_cu118/split_decision/build.ninja... Building extension module split_decision... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] c++ -MMD -MF splitgain_cpu.o.d -DTORCH_EXTENSION_NAME=split_decision -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/TH -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/THC -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/pgbm-2.1.1-py3.9-linux-x86_64.egg/pgbm/torch/splitgain_cpu.cpp -o splitgain_cpu.o [2/2] c++ splitgain_cpu.o -shared -L/home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o split_decision.so Loading extension module split_decision... Using /home/algo/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... No modifications detected for re-loaded extension module split_decision, skipping build step... Loading extension module split_decision...
Ah, great, did not (yet) know about this issue of cuda 12.1 but good to hear that it is fixed. Does it work now for your dataset?
It does, this is a really nice tool.
There is a slight issue in installing from source on GCC12
I think it is cython is being strict about noexcept
so one has to change splitting.pyx and add the noexcept to compare_cat_infos
cdef int compare_cat_infos(const void a, const void b) noexcept nogil:
I can write this up separately if you'd like
P.S. have you ever thought about using a similar method to fit a (vine) copula that captures the relationship between the marginals?
It's my hunch that for some data, the copula itself is more stable than means / vars
Regarding the cython part: yes, that would be great! I ported sklearn's implementation and have a bunch of tests running automatically, so could be that there are errors with certain versions of gcc. I guess a pull request is the best idea then.
Regarding last part: good point, and nice idea, no did not yet try this (tried many other variants to e.g. take away the loc, scale distribution assumption, none of which I could satisfactorily get to work :) There are more methods of doing this; it's a bit of a tradeoff between simplicity (i.e. high training speed and low storage requirements) and performance (i.e. distributional and point accuracy). I very much value the speed of GBMs, especially for large-scale settings, so our solution kind of leans towards being a bit more efficient rather than squeezing every inch of performance (I'm generally of the opinion that it's better to have a 90% good answer quick, rather than a 100% good answer slow, as quickly iterating over solutions is more valuable than just trying out a single solution)
Cool, I submitted a PR (https://github.com/elephaint/pgbm/pull/24) thanks.
I will go down the copula rabbit hole for a while and let you know if I come up with anything. Thanks!
Describe the bug nvcc error
To Reproduce Steps to reproduce the behavior:
from pgbm.sklearn import HistGradientBoostingRegressor, crps_ensemble from pgbm.torch import PGBMRegressor # If you want to use the Torch backend from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split from sklearn.datasets import fetch_california_housing import numpy as np
Using /home/algo/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/algo/.cache/torch_extensions/py39_cu118/split_decision/build.ninja... Building extension module split_decision... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] c++ -MMD -MF splitgain_cuda.o.d -DTORCH_EXTENSION_NAME=split_decision -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/TH -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/pgbm/torch/splitgain_cuda.cpp -o splitgain_cuda.o [2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=split_decision -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/TH -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -std=c++17 -c /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/pgbm/torch/splitgain_kernel.cu -o splitgain_kernel.cuda.o FAILED: splitgain_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=split_decision -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/TH -isystem /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -std=c++17 -c /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/pgbm/torch/splitgain_kernel.cu -o splitgain_kernel.cuda.o /home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/pybind11/cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type::type>::cast_op_type pybind11::detail::cast_op(make_caster&)’:
/home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/pybind11/cast.h:42:120: error: expected template-name before ‘<’ token
42 | return caster.operator typename make_caster::template cast_op_type();
| ^
/home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/pybind11/cast.h:42:120: error: expected identifier before ‘<’ token
/home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/pybind11/cast.h:42:123: error: expected primary-expression before ‘>’ token
42 | return caster.operator typename make_caster::template cast_op_type();
| ^
/home/algo/code/cqg/bts/spark/ml/.venv/lib64/python3.9/site-packages/torch/include/pybind11/cast.h:42:126: error: expected primary-expression before ‘)’ token
42 | return caster.operator typename make_caster::template cast_op_type();
| ^
ninja: build stopped: subcommand failed.
Expected behavior expect it to build
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information): alma8
nvcc 12.1 (.venv) algo@ch3li-fs03:/usr/local/cuda-12.1/bin$ ./nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0
gcc12 algo@ch3li-fs03:~/code/cqg/bts/spark/ml$ gcc --version gcc (GCC) 12.1.1 20220628 (Red Hat 12.1.1-3) Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Add any other context about the problem here.