Xtra-Computing / thundergbm

ThunderGBM: Fast GBDTs and Random Forests on GPUs
Apache License 2.0
695 stars 88 forks source link

FATAL [default] Check failed: [error == cudaSuccess] out of memory #37

Closed Kagaratsch closed 4 years ago

Kagaratsch commented 4 years ago

My system is:

(_env) D:_env\project\series>nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 442.19 Driver Version: 442.19 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+=================| | 0 GeForce RTX 2070 WDDM | 00000000:01:00.0 Off | N/A | | N/A 46C P8 7W / N/A | 219MiB / 8192MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |==============================================| | 0 7852 C+G ...xperience\NVIDIA GeForce Experience.exe N/A | +-----------------------------------------------------------------------------+

I am on Windows 10

When trying to run the TGBMClassifier I get the following error:

2020-02-22 20:13:15,901 INFO [default] #instances = 20289, #features = 40 2020-02-22 20:13:16,503 INFO [default] convert csr to csc using gpu... 2020-02-22 20:13:16,878 INFO [default] Converting csr to csc using time: 0.363863 s 2020-02-22 20:13:16,878 INFO [default] Fast getting cut points... 2020-02-22 20:13:16,878 FATAL [default] Check failed: [error == cudaSuccess] out of memory 2020-02-22 20:13:16,894 WARNING [default] Aborting application. Reason: Fatal log at [D:_env\project\series\thundergbm\src\thundergbm\syncmem.cpp:107]

Any suggestion how to fix this?

Kurt-Liuhf commented 4 years ago

Hi @Kagaratsch, would you mind providing your code and data? This will help us reproduce your result and help fix the issue. Thank you.

Kagaratsch commented 4 years ago

The code I use is:

import pandas as pd from thundergbm import * x_train=pd.read_csv('.\data.csv',delimiter=',', engine='c',dtype=np.float32) y_train=x_train.pop('label') rf_model = TGBMClassifier(n_trees=1000,bagging=1,num_class=3, column_sampling_rate=0.5, depth=100) rf_model.fit(x_train, y_train) rf_prediction = rf_model.predict(x_train) print(rf_prediction)

A snippet of the data with 300 entries can be found here: https://pastebin.com/ASAEVL94

This particular example now gives a slightly different error, but I suspect that they are related:

2020-02-23 17:02:50,006 INFO [default] #instances = 300, #features = 40 2020-02-23 17:02:50,102 INFO [default] convert csr to csc using gpu... 2020-02-23 17:02:50,890 INFO [default] Converting csr to csc using time: 0.784459 s 2020-02-23 17:02:50,894 INFO [default] Fast getting cut points... 2020-02-23 17:02:50,901 FATAL [default] Check failed: [buf_array.size() > new_size] The size of the target Syncarray must greater than the new size. 2020-02-23 17:02:50,907 WARNING [default] Aborting application. Reason: Fatal log at [D:/_env/project/series/thundergbm/src/thundergbm/hist_cut.cu:87]

Kagaratsch commented 4 years ago

In fact, to reproduce the original error message, we can use the same data.csv above and do

import pandas as pd from thundergbm import * x_train=pd.read_csv('.\data.csv',delimiter=',', engine='c',dtype=np.float32) x_copy=x_train.copy() for i in range(70): x_train=x_train.append(x_copy) y_train=x_train.pop('label') rf_model = TGBMClassifier(n_trees=1000,bagging=1,num_class=3, column_sampling_rate=0.5,depth=100) rf_model.fit(x_train, y_train) rf_prediction = rf_model.predict(x_train) print(rf_prediction)

This now gives the error

2020-02-23 17:13:41,923 INFO [default] #instances = 21300, #features = 40 2020-02-23 17:13:41,979 INFO [default] convert csr to csc using gpu... 2020-02-23 17:13:42,346 INFO [default] Converting csr to csc using time: 0.364385 s 2020-02-23 17:13:42,348 INFO [default] Fast getting cut points... 2020-02-23 17:13:42,444 FATAL [default] Check failed: [error == cudaSuccess] out of memory 2020-02-23 17:13:42,447 WARNING [default] Aborting application. Reason: Fatal log at [D:_env\project\series\thundergbm\src\thundergbm\syncmem.cpp:107]

Kurt-Liuhf commented 4 years ago

Thanks for your help. We have updated the wheel file of ThunderGBM. You can download it from here and reinstall ThunderGBM by using pip command. Please let us know if it works at your end.

zeyiwen commented 4 years ago

The problem seems to be fixed now. We would like to close this issue.