Errors thown when using custom loss function (rmsle)

dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

https://dreamquark-ai.github.io/tabnet/

MIT License

2.56k stars 473 forks source link

Errors thown when using custom loss function (rmsle) #438

Closed noahlh closed 1 year ago

noahlh commented 1 year ago

Hello! I'm super excited to dig into using pytorch_tabnet, but I've been banging my head against a wall for the past 2 nights on this issue, so I'm putting out a call for assistance.

I've got everything setup properly and confirmed that my data has no missing values and no values outside the defined dimensions.

I can train properly using the default (MSELoss) loss function, but for my particular problem I need to use either mean squared log error or, ideally, root mean squared log error.

I've defined a custom loss function as follows:

def rmsle_loss(y_pred, y_true):
    return torch.sqrt(nn.functional.mse_loss(torch.log(y_pred + 1), torch.log(y_true + 1)))

And I'm applying it to the model with the loss_fn=rmsle_loss parameter to .fit().

However - when I do this, I'm getting these dreaded errors.

Using CPU: index -1 is out of bounds for dimension 1 with size 22

Using GPU: CUDA error: device-side assert triggered

Both of these are being thrown at line 94 in sparsemax.py:

tau = input_cumsum.gather(dim, support_size - 1)

Note this ONLY happens when I'm using a custom loss function. I am able to train the model just fine using the default loss function, but since that's not ideal for my domain, I really need to use the custom function. As I mentioned above, I've confirmed that there are no inf, NA, or out-of-bounds data in my training set.

Any thoughts? Help would be deeply appreciated!

Optimox commented 1 year ago

can you share a minimal reproducible code ?

With just random data as input but just to show when the error happens and what are the different sizes of everything ?

noahlh commented 1 year ago

Thank you for the quick response @Optimox ! I just want to post a brief update that may or may not be helpful for others.

Shortly after I posted the question, I tried something based on a comment you wrote in the README on the rmsle metric, and I modified my custom loss function to clip values as such:

def rmsle_loss(y_pred, y_true):
    y_pred = torch.clamp(y_pred, min=0)
    return torch.sqrt(nn.functional.mse_loss(torch.log(y_pred + 1), torch.log(y_true + 1)))

That ceased the errors and got the model training (huzzah!). BUT - and I'm not sure if this is related to that clipping, I'm seeing a weird discrepancy between the reported loss and the train metrics. Here's sample output.

epoch 0  | loss: 1.86629 | train_rmsle: 2.766779899597168| valid_rmsle: 2.982069969177246|  0:00:24s
epoch 1  | loss: 0.96573 | train_rmsle: 1.4873700141906738| valid_rmsle: 1.3539700508117676|  0:00:49s
epoch 2  | loss: 0.76434 | train_rmsle: 0.5969700217247009| valid_rmsle: 0.8057399988174438|  0:01:13s
epoch 3  | loss: 0.6903  | train_rmsle: 0.42715999484062195| valid_rmsle: 0.5757799744606018|  0:01:38s
epoch 4  | loss: 0.65154 | train_rmsle: 0.3816699981689453| valid_rmsle: 0.514460027217865|  0:02:03s
epoch 5  | loss: 0.62432 | train_rmsle: 0.3643699884414673| valid_rmsle: 0.4750699996948242|  0:02:28s
epoch 6  | loss: 0.60392 | train_rmsle: 0.3346399962902069| valid_rmsle: 0.4924499988555908|  0:02:53s
epoch 7  | loss: 0.5908  | train_rmsle: 0.3143100142478943| valid_rmsle: 0.43748000264167786|  0:03:17s
epoch 8  | loss: 0.57508 | train_rmsle: 0.30393001437187195| valid_rmsle: 0.4202300012111664|  0:03:42s
epoch 9  | loss: 0.56603 | train_rmsle: 0.3111500144004822| valid_rmsle: 0.4397599995136261|  0:04:07s
epoch 10 | loss: 0.55719 | train_rmsle: 0.29249000549316406| valid_rmsle: 0.43369001150131226|  0:04:31s
epoch 11 | loss: 0.54947 | train_rmsle: 0.27636000514030457| valid_rmsle: 0.41561999917030334|  0:04:55s
epoch 12 | loss: 0.54189 | train_rmsle: 0.2677899897098541| valid_rmsle: 0.41488999128341675|  0:05:18s
epoch 13 | loss: 0.53559 | train_rmsle: 0.26249000430107117| valid_rmsle: 0.375900000333786|  0:05:42s
epoch 14 | loss: 0.53226 | train_rmsle: 0.2643899917602539| valid_rmsle: 0.4230700135231018|  0:06:06s
epoch 15 | loss: 0.52745 | train_rmsle: 0.2567000091075897| valid_rmsle: 0.39844998717308044|  0:06:30s

Any thoughts on why this might be happening? Is it just a fundamental difference in the way the loss vs. train_rmsle metric is calculated or is my clipping causing an issue? Or something else?

Thanks again for the assistance and any thoughts.

noahlh commented 1 year ago

One quick update - I figured I should post the actual model creation code w/ params I'm using, if that's helpful:

clf = TabNetRegressor(
  cat_idxs=cat_idxs,
  cat_dims=cat_dims,
  cat_emb_dim=1,
  n_d=32,
  n_a=32,
  n_steps=3,
  device_name='cuda',
)

clf.fit(
  x_train.values,
  y_train.values,
  eval_set=[(x_train.values, y_train.values), (x_test.values, y_test.values)],
  eval_name=['train', 'valid'],
  eval_metric=['rmsle'],
  max_epochs=200,
  patience=20,
  batch_size=16384,
  virtual_batch_size=1024,
  loss_fn=rmsle_loss,
  num_workers=20
)

Optimox commented 1 year ago

@noahlh good to know that you managed to make things work.

About the discrepancy between loss and train loss I see several reasons:

the loss is accumulated during the training epoch, while the metrics are computed at the end of an epoch. So it's usually ok to see a better metric score that the loss since the model has been improving during the epoch. However this won't explain the gap you see on its own.
sqrt function is concave: sqrt(a/2+b/2) > sqrt(a) /2+ sqrt(b)/2 so you should actually see the loss function smaller that the metric. So that is not the problem you have.

I had a look at how RMSLE is defined in the code and it calls mean_squared_log_error but without specifying squared=False.

So in the end, this is a bug of the repo which actually computed MSLE instead of RMSLE, you can define your custom working RMSLE like this and everything should be fine:

from pytorch-tabnet.metrics import Metric
class my_RMSLE(Metric):
    """
    Mean squared logarithmic error regression loss.
    Scikit-implementation:
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html
    Note: In order to avoid error, negative predictions are clipped to 0.
    This means that you should clip negative predictions manually after calling predict.
    """
    def __init__(self):
        self._name = "working_rmsle"
        self._maximize = False

    def __call__(self, y_true, y_score):
        """
        Compute RMSLE of predictions.
        Parameters
        ----------
        y_true : np.ndarray
            Target matrix or vector
        y_score : np.ndarray
            Score matrix or vector
        Returns
        -------
        float
            RMSLE of predictions vs targets.
        """
        y_score = np.clip(y_score, a_min=0, a_max=None)
        return mean_squared_log_error(y_true, y_score, squared=False)

And then you can pass eval_metric=['working_rmsle']

Let me know if this solves your problem.

noahlh commented 1 year ago

Oh wow you are legendary @Optimox! Thanks for uncovering that and I'm glad I was (indirectly) able to help fix a bug :)

I'm retraining now and there's still a slight discrepancy (see below), but it's now within range and likely for the reasons you mentioned, so I think we're all good. Many many thanks.

epoch 0  | loss: 2.37828 | train_rmsle: 1.7417000532150269| valid_rmsle: 1.7888699769973755|  0:00:22s
epoch 1  | loss: 1.3471  | train_rmsle: 2.0144999027252197| valid_rmsle: 2.071079969406128|  0:00:45s
epoch 2  | loss: 1.01037 | train_rmsle: 2.018090009689331| valid_rmsle: 2.063570022583008|  0:01:06s
epoch 3  | loss: 0.83754 | train_rmsle: 1.5472899675369263| valid_rmsle: 1.5472899675369263|  0:01:28s
epoch 4  | loss: 0.76075 | train_rmsle: 0.9113900065422058| valid_rmsle: 0.9303600192070007|  0:01:49s
epoch 5  | loss: 0.71234 | train_rmsle: 0.7181299924850464| valid_rmsle: 0.7953600287437439|  0:02:12s
epoch 6  | loss: 0.67979 | train_rmsle: 0.6658599972724915| valid_rmsle: 0.7813699841499329|  0:02:34s
epoch 7  | loss: 0.65395 | train_rmsle: 0.6251800060272217| valid_rmsle: 0.7234600186347961|  0:02:56s
epoch 8  | loss: 0.63447 | train_rmsle: 0.6097800135612488| valid_rmsle: 0.704200029373169|  0:03:18s
epoch 9  | loss: 0.62041 | train_rmsle: 0.5897899866104126| valid_rmsle: 0.7026200294494629|  0:03:39s
epoch 10 | loss: 0.60307 | train_rmsle: 0.5744100213050842| valid_rmsle: 0.6758300065994263|  0:04:01s
epoch 11 | loss: 0.59601 | train_rmsle: 0.5818799734115601| valid_rmsle: 0.6536700129508972|  0:04:23s
epoch 12 | loss: 0.58429 | train_rmsle: 0.560479998588562| valid_rmsle: 0.6636599898338318|  0:04:45s
epoch 13 | loss: 0.5752  | train_rmsle: 0.5513899922370911| valid_rmsle: 0.6779299974441528|  0:05:08s
epoch 14 | loss: 0.56832 | train_rmsle: 0.5371400117874146| valid_rmsle: 0.6313999891281128|  0:05:29s
epoch 15 | loss: 0.5622  | train_rmsle: 0.5362799763679504| valid_rmsle: 0.6614099740982056|  0:05:51s

Optimox commented 1 year ago

Thanks for pointing out the bug, glad that you can now properly train your tabnet model.

Just out of curiosity, have you been able to benchmark tabnet with other models? How does it compare for your problem?

Optimox commented 1 year ago

I've just read this thread again and I just want to explain why the use of torch.clamp is solving your problem and why you need to be careful when doing predictions with the model.

The error you had seem to be due to the fact that 1 + y_pred < 0. This is because the regressor must be able to predict both positive and negative values (some regression problems might require it).

If you only want positive outputs you can change the final activation function (which is the identity for the regressor). The use of torch.clamp is the same as using RELU activation function. So this solved your problem.

However your model will then train knowing that predicting anything bellow 0 will mean 0. But if you then use clf.predict this will be the output of the model without the final RELU activation, so you might end up with negative predictions. So you'll need to be careful when doing inference and always add the clamp function to your predictions.

noahlh commented 1 year ago

I've just read this thread again and I just want to explain why the use of torch.clamp is solving your problem and why you need to be careful when doing predictions with the model.

The error you had seem to be due to the fact that 1 + y_pred < 0. This is because the regressor must be able to predict both positive and negative values (some regression problems might require it).

If you only want positive outputs you can change the final activation function (which is the identity for the regressor). The use of torch.clamp is the same as using RELU activation function. So this solved your problem.

However your model will then train knowing that predicting anything bellow 0 will mean 0. But if you then use clf.predict this will be the output of the model without the final RELU activation, so you might end up with negative predictions. So you'll need to be careful when doing inference and always add the clamp function to your predictions.

Thank you for the heads-up! So this is a bit curious to me - maybe you have some thoughts...

(BTW: Disclaimer - I'm very much a "using ML to solve a tactical business problem" person, not a data scientist (yet) so please excuse any missing/incorrect terminology or lack of understanding of the theory here)

I'm not sure why the model making any predictions of the form 1 + y_pred < 0.

The problem space I'm working in is quite similar to the Blue Book for Bulldozers Kaggle competition -- I'm using historical pricing records of multi-attribute assets to predict current pricing. Pricing varies pretty dramatically so I'm assuming RMSLE is the best loss function here because outliers are important.

None of the training has negative pricing (the lowest is $1), nor should any of the outputs. So the model assuming that <0 == 0 is indeed correct.

Any thoughts?

noahlh commented 1 year ago

Thanks for pointing out the bug, glad that you can now properly train your tabnet model.

Just out of curiosity, have you been able to benchmark tabnet with other models? How does it compare for your problem?

Thank you for asking. I'm extremely eager to experiment with TabNet - it seems like it should perform excellently for my problem space. I'm currently in "chasing the dragon" mode -- the first version of the model I'm using in production was trained using Google Vertex AI and I ended up with an RMSLE of 0.488. Google is annoyingly opaque about what model they used - which is to say they tell you nothing but the evaluation metrics - but it's working admirably in production.

The main thing I'm benchmarking against in local development is LightGBM -- I haven't put that model into production yet, but I'm seeing a RMSLE of around 0.42 so I'm hopeful that will be an improvement.

So far I have yet to get TabNet properly hyperparameter tuned, but my initial training is giving me around 0.50 RMSLE. So it's within striking distance! I had previously played around with a few other implementations of TabNet (one via ludwig and one via the FastAI wrapper), but I wanted to be a bit closer to the metal so I can understand the pipeline better, so here I am working directly with pytorch-tabnet. Thanks again for this - it's exciting to dig deep into it.

Optimox commented 1 year ago

Thanks for the detailed information.

I'm not sure why the model making any predictions of the form 1 + y_pred < 0.

The model should not predict negative values if training data is only positive. And that will probably be the case when training is finished. However, at start the model weights are randomly initialized so it's very likely that negative values will occur. Even after few epochs, if the model needs to reach values as high as 10K, it's hard to implicitly be sure that no input will yield negative scores. So as long as you don't explicitly prevent the model to make negative predictions it might always happen.

Good luck with tuning your model!

SergeySakharovskiy commented 1 year ago

def rmsle_loss(y_pred, y_true): y_pred = torch.clamp(y_pred, min=0) return torch.sqrt(nn.functional.mse_loss(torch.log(y_pred + 1), torch.log(y_true + 1)))

Hi @noahlh,

Let me suggest torch.log1p which is a generic way to deal with torch.log(y_pred + 1) https://pytorch.org/docs/stable/generated/torch.log1p.html