Exception when running on GPU

slavakx commented 1 year ago

Hi,

I get the following error when I run on GPU. When running on CPU everything works fine: Input dimensions are: train: (72435, 413) target: (72435,)

Estimator 0/1430, Train metric: 0.9326
/local_disk0/.ephemeral_nfs/envs/pythonEnv-10f4fbcf-9bbd-4b56-a409-4a10041ef79a/lib/python3.9/site-packages/pgbm/pgbm.py:298: UserWarning: FALLBACK path has been taken inside: compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
To report the issue, try enable logging via setting the envvariable ` export PYTORCH_JIT_LOG_LEVEL=manager.cpp`
 (Triggered internally at  ../torch/csrc/jit/codegen/cuda/manager.cpp:237.)
  _create_tree(X_train_splits, gradient,
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-10f4fbcf-9bbd-4b56-a409-4a10041ef79a/lib/python3.9/site-packages/pgbm/pgbm.py", line 1135, in fallback_function
            # Compute total split_gain
            split_gain_tot = (Gl * Gl) / (Hl + reg_lambda) +\
                            (G - Gl)*(G - Gl) / (H - Hl + reg_lambda) -\
                             ~~~~~~ 
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <command-1911103480853639>:22
     19 train_final_pgbm_dense = train_final_pgbm.toarray()
     21 print("fitting ngboost")
---> 22 pgbm_model.fit(X = train_final_pgbm_dense, y = target_pgbm)
     24 pgbm_end_time = datetime.now()
     26 pgbm_runtime = (pgbm_end_time - pgbm_start_time).total_seconds()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10f4fbcf-9bbd-4b56-a409-4a10041ef79a/lib/python3.9/site-packages/pgbm/pgbm.py:1528, in PGBMRegressor.fit(self, X, y, eval_set, sample_weight, eval_sample_weight, early_stopping_rounds)
   1526 if self.init_model is None:
   1527     self.learner_ = PGBM()
-> 1528     self.learner_.train(train_set=(X, y), valid_set=eval_set, params=params, objective=self._objective, 
   1529                  metric=self._metric, sample_weight=sample_weight, 
   1530                  eval_sample_weight=eval_sample_weight)
   1531 else:
   1532     self.learner_.train(train_set=(X, y), valid_set=eval_set, objective=self._objective, 
   1533              metric=self._metric, sample_weight=sample_weight, 
   1534              eval_sample_weight=eval_sample_weight)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10f4fbcf-9bbd-4b56-a409-4a10041ef79a/lib/python3.9/site-packages/pgbm/pgbm.py:298, in PGBM.train(self, train_set, objective, metric, params, valid_set, sample_weight, eval_sample_weight)
    294 sample_features = torch.arange(self.n_features, device=self.torch_device) if self.feature_fraction == 1.0 else torch.randperm(self.n_features, device=self.torch_device)[:self.feature_samples]
    295 # Create tree
    296 self.nodes_idx, self.nodes_split_bin, self.nodes_split_feature, self.leaves_idx,\
    297 self.leaves_mu, self.leaves_var, self.feature_importance, yhat_train =\
--> 298     _create_tree(X_train_splits, gradient,
    299                 hessian, estimator, train_nodes, 
    300                 self.nodes_idx, self.nodes_split_bin, self.nodes_split_feature, 
    301                 self.leaves_idx, self.leaves_mu, self.leaves_var, 
    302                 self.feature_importance, yhat_train, self.learning_rate,
    303                 self.max_nodes, samples, sample_features, self.max_bin, 
    304                 self.min_data_in_leaf, self.reg_lambda, 
    305                 self.min_split_gain, self.any_monotone,
    306                 self.monotone_constraints, self.monotone_iterations)                       
    307 # Compute new gradient and hessian
    308 gradient, hessian = self.objective(yhat_train, y_train, sample_weight)

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-10f4fbcf-9bbd-4b56-a409-4a10041ef79a/lib/python3.9/site-packages/pgbm/pgbm.py", line 1135, in fallback_function
            # Compute total split_gain
            split_gain_tot = (Gl * Gl) / (Hl + reg_lambda) +\
                            (G - Gl)*(G - Gl) / (H - Hl + reg_lambda) -\
                             ~~~~~~ <--- HERE
                                (G * G) / (H + reg_lambda)
            # Only consider split gain when enough samples in leaves.
RuntimeError: The size of tensor a (2) must match the size of tensor b (256) at non-singleton dimension 1

elephaint commented 1 year ago

Thanks - I'll have a look at it this week. It appears that the CUDA kernel is built incorrectly on newer CUDA versions. I am trying to reproduce this bug.

slavakx commented 1 year ago

Maybe this is also relevant:

pgbm import works only for some python and cuda versions. I could import pgbm without errors only with the following configuration under Linux. In all other cases I got "error building extension 'split_decision'

python 3.9

conda install pytorch==1.12.0 torchvision==0.13.0 cudatoolkit=11.3 - pytorch

Then, I installed cuda_12.0.0_525.60.13_linux.run driver from nvidia website and configured PATH and LD_LIBRARY_PATH

 export PATH = /usr/local/cuda-12.0/bin${PATH:+:${PATH}}
 export LD_LIBRARY_PATH = ${LD_LIBRARY_PATH}:/usr/local/cuda-12.0/lib64

slavakx commented 1 year ago

Thanks - I'll have a look at it this week. It appears that the CUDA kernel is built incorrectly on newer CUDA versions. I am trying to reproduce this bug.

Same error occurrs in the example02_housing_gpu.ipynb

elephaint commented 1 year ago

Hi,

I've been trying to reproduce but unsuccesful unfortunately.... For now, there's a faster version of PGBM available on CPU through a fork on scikit-learn's HistGradientBoostingRegressor (docs). Maybe that can already help you. In the meantime, I'm still trying to reproduce.

slavakx commented 1 year ago

Hi,

I've been trying to reproduce but unsuccesful unfortunately.... For now, there's a faster version of PGBM available on CPU through a fork on scikit-learn's HistGradientBoostingRegressor (docs). Maybe that can already help you. In the meantime, I'm still trying to reproduce.

Hi, Thanks for the information. I've managed to run it on linux using cuda 11.1 and 11.3. Tried to replicate same approach on Windows failed. So currently my problem is to run PGBM on Windows.

elephaint commented 1 year ago

I still can't reproduce this issue, annoyingly... Could you (i) lay out the steps you took that led to the error on Windows (the packages you installed, and the order in which they were installed), and (ii) which versions you installed that generated the error?

I have Windows here too but everything runs fine - tried different versions of CUDA and all worked without issue. It must be some combination of Torch + Cuda that doesn't work, but I can't find it....

elephaint commented 5 months ago

Closing this issue for inactivity, hope it was resolved on your end....

elephaint / pgbm

Exception when running on GPU #18