Closed Almoal closed 3 years ago
i also have this problem, even the data i used so small, i think it is no matter about the GPU
data = load_breast_cancer()
xtrain,ytrain,xtest,ytest = train_test_split(data.data, data.target)
module = CatBoostClassifier(task_type='GPU',) module.fit(xtrain,xtest,eval_set=(ytrain, ytest),plot=False,)
you should thank me, i know why try to download older version. the 0.26 is problematic
I'm working with 0.25.1 as I did but the problem is that I have two GPUs and this error happen with 0.25.1 on one of the GPUs. I reported this issue now because on 0.26 happens in both GPUs.
Maybe a related issue, for me also version 0.26 stopped working on most of the tasks (training and new prediction mode using task_type='GPU'). Downgrading to 25.1 solves all problems.
A few syndromes happening only on 0.26, maybe trey will be helpful trying to diagnose that:
>> cb_model.predict(np.random.randn(10, 75), task_type='GPU')
Traceback (most recent call last): Debug Console, prompt 225, line 1
File "d:\app\Python38\Lib\site-packages\catboost\core.py", line 4729, in predict return self._predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose, 'predict', task_type) File "d:\app\Python38\Lib\site-packages\catboost\core.py", line 2177, in _predict predictions = self._base_predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose, task_type) File "d:\app\Python38\Lib\site-packages\catboost\core.py", line 1477, in _base_predict return self._object._base_predict(pool, prediction_type, ntree_start, ntree_end, thread_count, verbose, task_type) File "d:\app\Python38\Lib\site-packages\catboost_catboost.pyd", line 4482, in _catboost._CatBoost._base_predict File "d:\app\Python38\Lib\site-packages\catboost_catboost.pyd", line 4489, in _catboost._CatBoost._base_predict _catboost.CatBoostError: C:/Program Files (x86)/Go Agent/pipelines/BuildMaster/catboost.git/library/cpp/cuda/wrappers/cuda_vec.h:276: 10 ≠ 200
There's also a relation between the shape of data being passed for prediction and the error message (10 ≠ 200), so for example (for a model trained with 75 features):
data shape error message (10, 75) 10 ≠ 200 (100, 75) 100 ≠ 2000 (200, 75) 200 ≠ 4000
As already said: downgrading to 0.25.1 solves all issues, so that might be introduced in 0.26
catboost version: 0.26
Operating System: Windows 10
GPU: NVIDIA Quadro P4000
CUDA version: 11.1
NVIDIA driver version: 456.71
I also can see substantial similarity with my first syndrome listed above (process termination just after training) to what was described in #1732, although there it is claimed the issue is present also before 0.26, while I experienced that only and exclusively after upgrading to 0.26.
I'll work this out today.
I have same symphoms as @DanielLumb ( Win10
)
That was a really tough bug - this problem was a compilator bug for non-calling object destructor for temporary object from type-casting operator(). 7 work days of debug and voila! Will merge fix in https://github.com/catboost/catboost/pull/1763 and then we will publish release 0.26.1 in a matter of days, thank you all for patience 😺
Whoa, such bugs are a true nightmare to find.. Appreciate that very much and big big thank you @kizill for finding & fixing this!
I'm so glad there was a solution to this, was driving myself batty yesterday trying to figure out why my R catboost installation failed in training while using GPU only.
Many critically needed fixes and a lot of your work went into 26.1 - any hint as to when that 26.1 release could happen?
Published 0.26.1 with fix.
Published 0.26.1 with fix.
still dying using python 0.26.1
@renzeya Maybe we have something else, can you provide more details and a small reproducing code of possible?
GPU0:RTX3080, GPU1:RTX1070 WIN10
cat_model=cb.CatBoostRegressor(iterations=600,verbose=2,loss_function="Quantile:alpha=0.45",eval_metric="Quantile:alpha=0.45",task_type='GPU',devices='0',border_count=32,gpu_ram_part=0.49,has_time=True)
cat_model.fit(X_train,y_train,eval_set=(X_validation, y_validation),plot=False, use_best_model=True,early_stopping_rounds=max(int(iter_number/3),600))
Downgrading to 25.1 solves all problems.
Thanks!
From my experiences 26.1 really fixed the GPU crash that was introduced in 26.0 (that crashed practically any GPU functionality in Windows).
What still gives error/crash is prediction using task_type='GPU', that was introduced also in 26.0, but that is tracked by #972 and maybe that issue should be reopened.
Problem: "Error: kernel connection broken" catboost version: 0.26 Operating System: Windows 10 CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, 2592 Mhz GPU: NVIDIA Quadro T1000 CUDA toolkit: 11.3.1
Hi! I'm running a classification model on CatBoost but when I try to execute it with task_type = 'GPU' a message appear saying the kernel connection is broken. If I execute it with CPU I don't have any problem.
At the begining I saw on the task manager the GPU memory was at 100% and I tried limitation the usage but the error persist.
The error appears after 600 hundred of iterations (more or less). The details of my model are: