catboost / catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
https://catboost.ai
Apache License 2.0
7.99k stars 1.17k forks source link

Kernel dying when I execute with GPU #1735

Closed Almoal closed 3 years ago

Almoal commented 3 years ago

Problem: "Error: kernel connection broken" catboost version: 0.26 Operating System: Windows 10 CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, 2592 Mhz GPU: NVIDIA Quadro T1000 CUDA toolkit: 11.3.1

Hi! I'm running a classification model on CatBoost but when I try to execute it with task_type = 'GPU' a message appear saying the kernel connection is broken. If I execute it with CPU I don't have any problem.

At the begining I saw on the task manager the GPU memory was at 100% and I tried limitation the usage but the error persist.

The error appears after 600 hundred of iterations (more or less). The details of my model are:

model = CatBoostClassifier(
    iterations = 100000
    , learning_rate = 0.025
    , verbose = 2500
    , early_stopping_rounds = 1000
    , cat_features = cat_col
    , loss_function = "Logloss"
    , eval_metric = "F1"
    , task_type = "GPU"
    , gpu_ram_part = 0.75
    , depth = 8
    , border_count = 32
    , max_ctr_complexity = 2
    , random_seed = 42
)
model.fit(
    train_pool
    , eval_set = val_pool
    , plot = False
)
lastadreamer commented 3 years ago

i also have this problem, even the data i used so small, i think it is no matter about the GPU

data = load_breast_cancer()

xtrain,ytrain,xtest,ytest = train_test_split(data.data, data.target)

module = CatBoostClassifier(task_type='GPU',) module.fit(xtrain,xtest,eval_set=(ytrain, ytest),plot=False,)

lastadreamer commented 3 years ago

you should thank me, i know why try to download older version. the 0.26 is problematic

Almoal commented 3 years ago

I'm working with 0.25.1 as I did but the problem is that I have two GPUs and this error happen with 0.25.1 on one of the GPUs. I reported this issue now because on 0.26 happens in both GPUs.

DanielLumb commented 3 years ago

Maybe a related issue, for me also version 0.26 stopped working on most of the tasks (training and new prediction mode using task_type='GPU'). Downgrading to 25.1 solves all problems.

A few syndromes happening only on 0.26, maybe trey will be helpful trying to diagnose that:

Traceback (most recent call last): Debug Console, prompt 225, line 1

-- coding: utf8 --

File "d:\app\Python38\Lib\site-packages\catboost\core.py", line 4729, in predict return self._predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose, 'predict', task_type) File "d:\app\Python38\Lib\site-packages\catboost\core.py", line 2177, in _predict predictions = self._base_predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose, task_type) File "d:\app\Python38\Lib\site-packages\catboost\core.py", line 1477, in _base_predict return self._object._base_predict(pool, prediction_type, ntree_start, ntree_end, thread_count, verbose, task_type) File "d:\app\Python38\Lib\site-packages\catboost_catboost.pyd", line 4482, in _catboost._CatBoost._base_predict File "d:\app\Python38\Lib\site-packages\catboost_catboost.pyd", line 4489, in _catboost._CatBoost._base_predict _catboost.CatBoostError: C:/Program Files (x86)/Go Agent/pipelines/BuildMaster/catboost.git/library/cpp/cuda/wrappers/cuda_vec.h:276: 10 ≠ 200

There's also a relation between the shape of data being passed for prediction and the error message (10 ≠ 200), so for example (for a model trained with 75 features):

data shape error message (10, 75) 10 ≠ 200 (100, 75) 100 ≠ 2000 (200, 75) 200 ≠ 4000


As already said: downgrading to 0.25.1 solves all issues, so that might be introduced in 0.26

catboost version: 0.26
Operating System: Windows 10
GPU: NVIDIA Quadro P4000
CUDA version: 11.1
NVIDIA driver version: 456.71
DanielLumb commented 3 years ago

I also can see substantial similarity with my first syndrome listed above (process termination just after training) to what was described in #1732, although there it is claimed the issue is present also before 0.26, while I experienced that only and exclusively after upgrading to 0.26.

kizill commented 3 years ago

I'll work this out today.

holinov commented 3 years ago

I have same symphoms as @DanielLumb ( Win10 )

kizill commented 3 years ago

That was a really tough bug - this problem was a compilator bug for non-calling object destructor for temporary object from type-casting operator(). 7 work days of debug and voila! Will merge fix in https://github.com/catboost/catboost/pull/1763 and then we will publish release 0.26.1 in a matter of days, thank you all for patience 😺

DanielLumb commented 3 years ago

Whoa, such bugs are a true nightmare to find.. Appreciate that very much and big big thank you @kizill for finding & fixing this!

mm0708 commented 3 years ago

I'm so glad there was a solution to this, was driving myself batty yesterday trying to figure out why my R catboost installation failed in training while using GPU only.

DanielLumb commented 3 years ago

Many critically needed fixes and a lot of your work went into 26.1 - any hint as to when that 26.1 release could happen?

kizill commented 3 years ago

Published 0.26.1 with fix.

renzeya commented 3 years ago

Published 0.26.1 with fix.

still dying using python 0.26.1

kizill commented 3 years ago

@renzeya Maybe we have something else, can you provide more details and a small reproducing code of possible?

renzeya commented 3 years ago

GPU0:RTX3080, GPU1:RTX1070 WIN10

cat_model=cb.CatBoostRegressor(iterations=600,verbose=2,loss_function="Quantile:alpha=0.45",eval_metric="Quantile:alpha=0.45",task_type='GPU',devices='0',border_count=32,gpu_ram_part=0.49,has_time=True)

cat_model.fit(X_train,y_train,eval_set=(X_validation, y_validation),plot=False, use_best_model=True,early_stopping_rounds=max(int(iter_number/3),600))

Downgrading to 25.1 solves all problems.

Thanks!

DanielLumb commented 3 years ago

From my experiences 26.1 really fixed the GPU crash that was introduced in 26.0 (that crashed practically any GPU functionality in Windows).

What still gives error/crash is prediction using task_type='GPU', that was introduced also in 26.0, but that is tracked by #972 and maybe that issue should be reopened.