dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
https://dreamquark-ai.github.io/tabnet/
MIT License
2.55k stars 470 forks source link

eval_metric=['rmsle'] calculates msle but not RMSLE. #470

Closed SergeySakharovskiy closed 1 year ago

SergeySakharovskiy commented 1 year ago

TabNetRegressor calculates MSLE rather than RMSLE when eval_metric is set to ['rmsle']

cv = model_selection.KFold(n_splits=config['FOLDS'], shuffle=True, random_state=config['SEED'])
for fold, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train), start=1):
    if fold == 2: break
    # Split the dataset according to the fold indexes.
    X_fit = X_train.iloc[fit_idx].values
    X_val = X_train.iloc[val_idx].values
    y_fit = y_train.iloc[fit_idx].values.reshape(-1, 1)
    y_val = y_train.iloc[val_idx].values.reshape(-1, 1)

    clf = TabNetRegressor()
    clf.fit(
        X_train=X_fit, y_train=y_fit,
        eval_set=[(X_fit, y_fit), (X_val, y_val)],
        eval_name=['train', 'valid'],
        eval_metric=['rmsle'],
        max_epochs=30,
        patience=50,
        batch_size=8192, virtual_batch_size=128,
        num_workers=24,
        drop_last=False,
    ) 

epoch 28 | loss: 844.81883| train_rmsle: 0.09658 | valid_rmsle: 0.09734 | 0:03:29s epoch 29 | loss: 843.59653| train_rmsle: 0.09634 | valid_rmsle: 0.09707 | 0:03:36s Expected behavior

It seems this line of code https://github.com/dreamquark-ai/tabnet/blob/fc59ea61139228440d2063ead9db42f656d84ff7/pytorch_tabnet/metrics.py#L403 should have squared=False.

It gives the correct score when clf.best_cost is square rooted:

from sklearn import metrics
preds = clf.predict(X_train.values)

val_score = metrics.mean_squared_log_error(y_train, y_pred=preds, squared=False)

print(f"TabNet VALID SCORE RMSLE: {clf.best_cost}")
print(f"SKLEARN VALID SCORE RMSLE: {val_score}")
print(f'TabNet best score + numpy square root  RMSLE: {np.sqrt([clf.best_cost])}')

TabNet VALID SCORE RMSLE: 0.09658693470083708 SKLEARN VALID SCORE RMSLE: 0.309868387923609 TabNet best score + numpy square root RMSLE: [0.31078439]

Optimox commented 1 year ago

Yes this is a known bug, please see how to deal with it before the official fix comes in this issue : #438

SergeySakharovskiy commented 1 year ago

@Optimox thank you, that works. I will define the custom rmsle as you suggested:

from pytorch-tabnet.metrics import Metric
class my_RMSLE(Metric):
    """
    Mean squared logarithmic error regression loss.
    Scikit-implementation:
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html
    Note: In order to avoid error, negative predictions are clipped to 0.
    This means that you should clip negative predictions manually after calling predict.
    """
    def __init__(self):
        self._name = "working_rmsle"
        self._maximize = False

    def __call__(self, y_true, y_score):
        """
        Compute RMSLE of predictions.
        Parameters
        ----------
        y_true : np.ndarray
            Target matrix or vector
        y_score : np.ndarray
            Score matrix or vector
        Returns
        -------
        float
            RMSLE of predictions vs targets.
        """
        y_score = np.clip(y_score, a_min=0, a_max=None)
        return mean_squared_log_error(y_true, y_score, squared=False)