dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
https://dreamquark-ai.github.io/tabnet/
MIT License
2.59k stars 483 forks source link

scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device)) RuntimeError: CUDA error: device-side assert triggered #240

Closed ThomasWolf0701 closed 3 years ago

ThomasWolf0701 commented 3 years ago

Describe the bug When running on GPU Tabnet crashes with scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device)) RuntimeError: CUDA error: device-side assert triggered

What is the current behavior? It works when the matrix I use contains only integers but fails with floats. I also made sure that NaN values are imputed and there are no Inf. Also the largest value fits into float32, Also set the batch size to a very low level.

If the current behavior is a bug, please provide the steps to reproduce. tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10)

Expected behavior

Screenshots

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),batch_size = 10) Traceback (most recent call last):

File "", line 1, in tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),batch_size = 10)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 329, in fit fit_params_steps = self._check_fit_params(**fit_params)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 248, in _check_fit_params "=sample_weight)`.".format(pname))

ValueError: Pipeline.fit does not accept the batch_size parameter. You can pass parameters to specific steps of your pipeline using the stepnameparameter format, e.g. `Pipeline.fit(X, y, logisticregressionsample_weight=sample_weight)`.

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10) No early stopping will be performed, last training weights will be used. Traceback (most recent call last):

File "", line 1, in tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit self._train_epoch(train_dataloader)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch batch_logs = self._train_batch(X, y)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch output, M_loss = self.network(X)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward return self.tabnet(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 151, in forward out = self.feat_transformersstep

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 375, in forward x = self.shared(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 409, in forward scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device))

RuntimeError: CUDA error: device-side assert triggered

Other relevant information: poetry version:
python version: Operating System: Additional tools:

Additional context

Optimox commented 3 years ago

Hello,

This line ValueError: Pipeline.fit does not accept the batch_size parameter. You can pass parameters to specific steps of your pipeline using the stepname__parameter format, e.g. Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight) makes me think that you are using tabnet inside a sklearn pipeline.

Tabnet is not compatible with all sklearn pipeline, I guess that's the problem. Could you share the code you are running with TabNet?

ThomasWolf0701 commented 3 years ago

Here it is:

imputer = SimpleImputer(missing_values=np.nan,strategy='mean') scorer = make_scorer(mean_squared_error, greater_is_better= False)

inner_cv = TimeSeriesSplit(n_splits=5)#.split(featureMatrix) outer_cv = PredefinedHoldoutSplit(valid_indices=[range(0,100,1)]

set the training parameters for Random Forest

paramsTab = { 'mn_steps': randint(1,3), 'm__n_a': randint(8,64), 'mn_d': randint(8,64), 'mgamma': uniform(1, 1), 'mn_shared': randint(1, 5), 'm__n_independent': randint(1, 5), 'mmomentum': loguniform(0.01, 0.4), "mmask_type":["sparsemax", "entmax"] }

tab_model = TabNetRegressor(device_name = "cuda")

tab_model = Pipeline(steps=[('i', imputer),('m', tab_model)])

tab_search = RandomizedSearchCV(tab_model,scoring = scorer ,param_distributions=paramsTab, random_state=42, cv=inner_cv, verbose=5, n_jobs=1, return_train_score=True)

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10) tab_search.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

ThomasWolf0701 commented 3 years ago

Also tried without batch_size and now i get: tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values)) No early stopping will be performed, last training weights will be used. Traceback (most recent call last):

File "", line 1, in tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit self._train_epoch(train_dataloader)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch batch_logs = self._train_batch(X, y)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch output, M_loss = self.network(X)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward return self.tabnet(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 151, in forward out = self.feat_transformersstep

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 375, in forward x = self.shared(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 409, in forward scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device))

RuntimeError: CUDA error: device-side assert triggered

ThomasWolf0701 commented 3 years ago

I also did a grid search using the cpu, for most iterations it works but some give an error:

[CV] mgamma=1.3745401188473625, mmask_type=sparsemax, mmomentum=0.019673133446482256, mn_independent=4, m__n_shared=1, m__n_steps=1, score=(train=nan, test=nan), total= 38.5sC:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_validation.py", line 531, in _fit_and_score estimator.fit(X_train, y_train, fit_params) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit self._final_estimator.fit(Xt, y, fit_params_last_step) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit self._train_epoch(train_dataloader) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch batch_logs = self._train_batch(X, y) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch output, M_loss = self.network(X) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward return self.tabnet(x) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 144, in forward M = self.att_transformers[step](prior, att) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 322, in forward x = self.selector(x) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 91, in forward return sparsemax(input, self.dim) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 43, in forward tau, supp_size = SparsemaxFunction._threshold_and_support(input, dim=dim) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 76, in _threshold_and_support tau = input_cumsum.gather(dim, support_size - 1) RuntimeError: index -1 is out of bounds for dimension 1 with size 2423

FitFailedWarning) [Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 2.9min remaining: 0.0s

Optimox commented 3 years ago

Well I think this is related to GridSearchCV and TabNet you have to wrap tabnet into a bigger class. This is not included in the library yet.

You can have a look at #238 and #80. Maybe @nclibz can help you.

ThomasWolf0701 commented 3 years ago

I also crashes when i do not use grid search but just fit the model: tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

Here is a runnable code example, and the example data as a pickle: bugfixData.zip

import numpy as np import pandas as pd from scipy import sparse from sklearn.impute import SimpleImputer from sklearn.model_selection import TimeSeriesSplit from mlxtend.evaluate import PredefinedHoldoutSplit from sklearn.model_selection import RandomizedSearchCV from sklearn.metrics import mean_squared_error from sklearn.metrics import make_scorer from scipy.stats import uniform from scipy.stats import loguniform from scipy.stats import randint from pytorch_tabnet.tab_model import TabNetRegressor from sklearn.pipeline import Pipeline

imputer = SimpleImputer(missing_values=np.nan,strategy='mean')

scorer = make_scorer(mean_squared_error, greater_is_better= False)

inner_cv = TimeSeriesSplit(n_splits=5)

outer_cv = PredefinedHoldoutSplit(valid_indices=[range(0,100,1)])

set the training parameters for Random Forest

paramsTab = { 'mn_steps': randint(1,3), 'm__n_a': randint(8,64), 'mn_d': randint(8,64), 'mgamma': uniform(1, 1), 'mn_shared': randint(1, 5), 'm__n_independent': randint(1, 5), 'mmomentum': loguniform(0.01, 0.4), "mmask_type":["sparsemax", "entmax"] }

tab_model = TabNetRegressor(device_name = "cuda")

tab_model = Pipeline(steps=[('i', imputer),('m', tab_model)])

tab_search = RandomizedSearchCV(tab_model,scoring = scorer ,param_distributions=paramsTab, random_state=42, cv=inner_cv, verbose=5, n_jobs=1, return_train_score=True)

run with basic settings:

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

run with feature search

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

ThomasWolf0701 commented 3 years ago

Here the error without grid search on the GPU: Device used : cuda

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values)) No early stopping will be performed, last training weights will be used. Traceback (most recent call last):

File "", line 1, in tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit self._train_epoch(train_dataloader)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch batch_logs = self._train_batch(X, y)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch output, M_loss = self.network(X)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward return self.tabnet(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 151, in forward out = self.feat_transformersstep

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 375, in forward x = self.shared(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 409, in forward scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device))

RuntimeError: CUDA error: device-side assert triggered

ThomasWolf0701 commented 3 years ago

And here without gridsearch on the cpu:

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values)) No early stopping will be performed, last training weights will be used. epoch 0 | loss: 20287.99239| 0:00:05s epoch 1 | loss: 20290.36925| 0:00:09s Traceback (most recent call last):

File "", line 1, in tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit self._train_epoch(train_dataloader)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch batch_logs = self._train_batch(X, y)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch output, M_loss = self.network(X)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward return self.tabnet(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 144, in forward M = self.att_transformers[step](prior, att)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 322, in forward x = self.selector(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 91, in forward return sparsemax(input, self.dim)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 43, in forward tau, supp_size = SparsemaxFunction._threshold_and_support(input, dim=dim)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 76, in _threshold_and_support tau = input_cumsum.gather(dim, support_size - 1)

RuntimeError: index -1 is out of bounds for dimension 1 with size 2423

ThomasWolf0701 commented 3 years ago

When i run grid search on the cpu it actually works on some combinations and fails on others:

[CV] mgamma=1.3745401188473625, mmask_type=sparsemax, mmomentum=0.019673133446482256, mn_a=15, mn_d=28, m__n_independent=3, mn_shared=2, mn_steps=1, score=(train=-17302.082, test=-26498.246), total= 53.9s Device used : cpu [CV] mgamma=1.3745401188473625, mmask_type=sparsemax, mmomentum=0.019673133446482256, mn_a=15, m__n_d=28, mn_independent=3, mn_shared=2, mn_steps=1 No early stopping will be performed, last training weights will be used. [Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 1.4min remaining: 0.0s epoch 0 | loss: 21073.06641| 0:00:00s epoch 1 | loss: 21039.7832| 0:00:01s epoch 2 | loss: 21034.89062| 0:00:02s epoch 3 | loss: 21004.1582| 0:00:02s epoch 4 | loss: 20963.84766| 0:00:03s epoch 5 | loss: 20944.4707| 0:00:04s epoch 6 | loss: 20882.54297| 0:00:04s epoch 7 | loss: 20821.25977| 0:00:05s epoch 8 | loss: 20774.79102| 0:00:06s epoch 9 | loss: 20731.32812| 0:00:07s epoch 10 | loss: 20618.95117| 0:00:07s epoch 11 | loss: 20547.69727| 0:00:08s epoch 12 | loss: 20460.16602| 0:00:09s epoch 13 | loss: 20259.37695| 0:00:09s epoch 14 | loss: 20088.0 | 0:00:10s epoch 15 | loss: 19996.99023| 0:00:11s epoch 16 | loss: 19853.4375| 0:00:12s epoch 17 | loss: 19550.07812| 0:00:12s epoch 18 | loss: 19418.98438| 0:00:13s epoch 19 | loss: 19086.25781| 0:00:14s epoch 20 | loss: 18964.19531| 0:00:14s epoch 21 | loss: 18776.58594| 0:00:15s epoch 22 | loss: 18383.12109| 0:00:16s epoch 23 | loss: 18161.46094| 0:00:16s epoch 24 | loss: 17829.95117| 0:00:17s epoch 25 | loss: 17601.00781| 0:00:18s epoch 26 | loss: 17238.53516| 0:00:18s epoch 27 | loss: 17147.02734| 0:00:19s epoch 28 | loss: 16768.46484| 0:00:20s epoch 29 | loss: 16510.31641| 0:00:20s epoch 30 | loss: 16261.33984| 0:00:21s epoch 31 | loss: 16023.36523| 0:00:22s epoch 32 | loss: 15777.2998| 0:00:23s epoch 33 | loss: 15456.74023| 0:00:23s epoch 34 | loss: 15194.16406| 0:00:24s epoch 35 | loss: 14878.4541| 0:00:25s epoch 36 | loss: 14469.09863| 0:00:26s epoch 37 | loss: 14036.95996| 0:00:26s epoch 38 | loss: 13652.70703| 0:00:27s epoch 39 | loss: 13418.50098| 0:00:28s epoch 40 | loss: 13139.40527| 0:00:28s epoch 41 | loss: 12592.125| 0:00:29s epoch 42 | loss: 12505.69727| 0:00:30s epoch 43 | loss: 12072.10449| 0:00:30s epoch 44 | loss: 11610.03027| 0:00:31s epoch 45 | loss: 11290.94141| 0:00:32s epoch 46 | loss: 10918.68359| 0:00:32s epoch 47 | loss: 10666.70801| 0:00:33s epoch 48 | loss: 10278.82324| 0:00:34s epoch 49 | loss: 9967.96875| 0:00:35s epoch 50 | loss: 9642.63379| 0:00:35s epoch 51 | loss: 9285.58594| 0:00:36s epoch 52 | loss: 8993.60645| 0:00:37s epoch 53 | loss: 8890.46582| 0:00:37s epoch 54 | loss: 8505.64355| 0:00:38s epoch 55 | loss: 8310.10352| 0:00:39s epoch 56 | loss: 7661.33545| 0:00:40s epoch 57 | loss: 7571.51367| 0:00:40s epoch 58 | loss: 7224.49414| 0:00:41s epoch 59 | loss: 7024.98682| 0:00:42s epoch 60 | loss: 6703.88184| 0:00:42s epoch 61 | loss: 6481.61816| 0:00:43s epoch 62 | loss: 6211.16309| 0:00:44s epoch 63 | loss: 5893.41602| 0:00:44s epoch 64 | loss: 5768.60596| 0:00:45s epoch 65 | loss: 5505.07666| 0:00:46s epoch 66 | loss: 5340.00146| 0:00:46s epoch 67 | loss: 5207.09424| 0:00:47s epoch 68 | loss: 4787.2002| 0:00:48s epoch 69 | loss: 4758.07373| 0:00:48s epoch 70 | loss: 4562.87109| 0:00:49s epoch 71 | loss: 4998.98828| 0:00:50s [CV] mgamma=1.3745401188473625, m__mask_type=sparsemax, mmomentum=0.019673133446482256, mn_a=15, m__n_d=28, mn_independent=3, m__n_shared=2, m__n_steps=1, score=(train=nan, test=nan), total= 50.7sC:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_validation.py", line 531, in _fit_and_score estimator.fit(X_train, y_train, fit_params) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit self._final_estimator.fit(Xt, y, fit_params_last_step) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit self._train_epoch(train_dataloader) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch batch_logs = self._train_batch(X, y) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch output, M_loss = self.network(X) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward return self.tabnet(x) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 144, in forward M = self.att_transformers[step](prior, att) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 322, in forward x = self.selector(x) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 91, in forward return sparsemax(input, self.dim) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 43, in forward tau, supp_size = SparsemaxFunction._threshold_and_support(input, dim=dim) File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 76, in _threshold_and_support tau = input_cumsum.gather(dim, support_size - 1) RuntimeError: index -1 is out of bounds for dimension 1 with size 2423

Optimox commented 3 years ago

this error tau = input_cumsum.gather(dim, support_size - 1) RuntimeError: index -1 is out of bounds for dimension 1 with size 2423 always appeared when you have nan or inf values in your training data. Please check the numpy array you are giving after sparse to dense.

It can also be due to gradient fading or explosion. You can add a clip_value which may help or change the learning rate.

ThomasWolf0701 commented 3 years ago

"It can also be due to gradient fading or explosion"

Checked again and the values i gave the model to fit the featureMatrix against were off. So probably the model might not able to fit anything sensible. Fixed this and the error still occurs. Will have a look at the other suggestions.

And attached the updated data. bugFix2.zip

ThomasWolf0701 commented 3 years ago

I checked for infs, made sure that it´s float32 compatible and NaN are being imputed by the pipeline. Also i made sure that the values are correct. So this is not the probblem.

As some of the GridSearches work and some don´t it is either a problem of the settings, as the standard settings don´t work and some of the tried ones do. Or some cross validation folds are working an some do not.

Optimox commented 3 years ago

maybe entmax is working but sparsemax is creating this error, did you check that?

ThomasWolf0701 commented 3 years ago

maybe entmax is working but sparsemax is creating this error, did you check that?

I ran the gridsearch on the cpu using only entmax and for the cpu this fixed the problem, even when varying the other parameters. Will also check for the gpu (cuda).

ThomasWolf0701 commented 3 years ago

While it now works for the cpu with entmax, for GPU I still get the following error:

[Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 21.3s finished Traceback (most recent call last):

File "", line 1, in tab_search.fit(np.array(featureMatrix.iloc[:,:].sparse.to_dense().values),np.array(values))

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f return f(**kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_search.py", line 765, in fit self.bestestimator.fit(X, y, **fit_params)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 159, in fit self._set_network()

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 474, in _set_network mask_type=self.mask_type,

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 272, in init self.to(self.device)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 612, in to return self._apply(convert)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 359, in _apply module._apply(fn)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 359, in _apply module._apply(fn)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 381, in _apply param_applied = fn(param)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 610, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking)

RuntimeError: CUDA error: device-side assert triggered

ThomasWolf0701 commented 3 years ago

Also ran the cpu again and even with entmax the erro is not completely gone just seems to happen less often.

Optimox commented 3 years ago

@ThomasWolf0701 Could you share some examples of parameters which trigger the error and some that don't?

Optimox commented 3 years ago

Alright, I'm closing this as I could not reproduce it easily. Please feel free to reopen with more information when you have some to share.

Promisery commented 3 years ago

Are you using a RTX 30x0 GPU? I'm encountering the same problem. I changed the scale = *** line to init method of GLU, and got cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. Maybe it's caused by CUDNN. I'm currently using CUDA 11.2 and cuDNN 8.0.5

ThomasWolf0701 commented 3 years ago

Alright, I'm closing this as I could not reproduce it easily. Please feel free to reopen with more information when you have some to share.

Ok will try to run with different settings and can check which ones crash.

ThomasWolf0701 commented 3 years ago

Are you using a RTX 30x0 GPU? I'm encountering the same problem. I changed the scale = * line to init** method of GLU, and got cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. Maybe it's caused by CUDNN. I'm currently using CUDA 11.2 and cuDNN 8.0.5

Im am using a GTX 1070 and Cuda 10.2

Vishal-ib97 commented 3 years ago

I'm facing the same issue. Were you able to resolve this?

soonwoojung commented 3 years ago

I'm facing the same issue. // after reducing gradien explosion, it seems not occuring. i changed weight decay higher

bhavana3 commented 2 years ago

@Optimox I am facing the same issue to train a model on tabular data, from the call stack, its coming from the line tau_star = tau.gather(dim, support_size - 1) RuntimeError: Invalid index in gather at /opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:657

I am trying to build a simple model with below hyper parameters. Data is tabular and most of them are float64.

model = TabNetClassifier(n_d=8, n_a=8, n_steps=1, gamma=1.3, lambda_sparse=0,optimizer_fn=torch.optim.Adam, optimizer_params=dict(lr=2e-2), mask_type='entmax', device_name='cpu', output_dim=1, scheduler_params=dict(milestones=[100,150], gamma=0.9), scheduler_fn=torch.optim.lr_scheduler.MultiStepLR)

'sparsemax'

        model.fit(X_train=X_train, y_train=y_train, eval_set=[(X_train, y_train), (X_val, y_val)], max_epochs=EPOCHS, patience=20, batch_size=1024, virtual_batch_size=128, eval_name=['train', 'valid'],eval_metric=['auc'], num_workers=0, weights=1, drop_last=False)

I am unsure what these params are, Could you please suggest/assist here

@soonwoojung

Optimox commented 2 years ago

@bhavana3 try setting a gradient clipping limit clip_value = 5 for example