jamesdbaker commented 4 years ago

Describe the bug On some models, I get the following warning and then an evaluation of 0 true positives and 0 false positives. If I use a smaller version of the same model, I do get true and false positives - so I assume this is something to do with larger models. It has happened on a couple of different models.

The warning I'm getting is:

/home/ubuntu/miniconda3/envs/transformers/lib/python3.8/site-packages/sklearn/metrics/_classification.py:846: RuntimeWarning: invalid value encountered in double_scalars
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)

The full output is:

INFO:filelock:Lock 140299509231568 acquired on /home/ubuntu/.cache/torch/transformers/a1cbd52b6a24c283740550c3bf4d5ed26697a73a6d1d332362721c447fe43351.b692380ce5401c8b657db55b9c15fbd5dd061db5431e0f3c97ef03bef2f7e8e6.lock
Downloading: 100%|███████████████████████████████████████████████████| 685/685 [00:00<00:00, 696kB/s]
INFO:filelock:Lock 140299509231568 released on /home/ubuntu/.cache/torch/transformers/a1cbd52b6a24c283740550c3bf4d5ed26697a73a6d1d332362721c447fe43351.b692380ce5401c8b657db55b9c15fbd5dd061db5431e0f3c97ef03bef2f7e8e6.lock
INFO:filelock:Lock 140299509230224 acquired on /home/ubuntu/.cache/torch/transformers/17d7302f798a70efce576b97ae3fa97fa4a8c9b9c5c78f5ee30b8408f6e2cb43.6d16c2a53c86e103e95956fac9f7e14c3c74dccf63ed4b635e3de273fbdaeb9f.lock
Downloading: 100%|████████████████████████████████████████████████| 236M/236M [00:05<00:00, 40.9MB/s]
INFO:filelock:Lock 140299509230224 released on /home/ubuntu/.cache/torch/transformers/17d7302f798a70efce576b97ae3fa97fa4a8c9b9c5c78f5ee30b8408f6e2cb43.6d16c2a53c86e103e95956fac9f7e14c3c74dccf63ed4b635e3de273fbdaeb9f.lock
INFO:filelock:Lock 140299507579152 acquired on /home/ubuntu/.cache/torch/transformers/02112eba687f794948810d2215028e9a0e77585b966ac59854a8d73e2d344d0b.c81d4deb77aec08ce575b7a39a989a79dd54f321bfb82c2b54dd35f52f8182cf.lock
Downloading: 100%|████████████████████████████████████████████████| 760k/760k [00:00<00:00, 1.98MB/s]
INFO:filelock:Lock 140299507579152 released on /home/ubuntu/.cache/torch/transformers/02112eba687f794948810d2215028e9a0e77585b966ac59854a8d73e2d344d0b.c81d4deb77aec08ce575b7a39a989a79dd54f321bfb82c2b54dd35f52f8182cf.lock
INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.
100%|█████████████████████████████████████████████████████████████| 536/536 [00:01<00:00, 442.32it/s]
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Running loss: 0.891888Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
/home/ubuntu/miniconda3/envs/transformers/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:111: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Seems like `optimizer.step()` has been overridden after learning rate scheduler "
Running loss: 0.903847Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Running loss: 0.785572/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of add_ is deprecated:        | 2/67 [00:01<00:53,  1.22it/s]
    add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
    add_(Tensor other, *, Number alpha)
Running loss: 0.411454/home/ubuntu/miniconda3/envs/transformers/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:231: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
  warnings.warn("To get the last learning rate computed by the scheduler, "
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:42<00:00,  1.57it/s]
/home/ubuntu/miniconda3/envs/transformers/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.
  warnings.warn(SAVE_STATE_WARNING, UserWarning)
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:42<00:00,  1.57it/s]
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:43<00:00,  1.56it/s]
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:43<00:00,  1.54it/s]
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:44<00:00,  1.51it/s]
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:44<00:00,  1.50it/s]
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:45<00:00,  1.49it/s]
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:45<00:00,  1.48it/s]
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:45<00:00,  1.48it/s]
Current iteration: 100%|█████████████████████████████████████████████| 67/67 [00:45<00:00,  1.48it/s]
Epoch: 100%|█████████████████████████████████████████████████████████| 10/10 [07:30<00:00, 45.03s/it]
INFO:simpletransformers.classification.classification_model: Training of albert model complete. Saved to outputs/.
INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.
100%|█████████████████████████████████████████████████████████████| 135/135 [00:00<00:00, 324.53it/s]
100%|████████████████████████████████████████████████████████████████| 17/17 [00:04<00:00,  4.03it/s]
/home/ubuntu/miniconda3/envs/transformers/lib/python3.8/site-packages/sklearn/metrics/_classification.py:846: RuntimeWarning: invalid value encountered in double_scalars
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
INFO:simpletransformers.classification.classification_model:{'mcc': 0.0, 'tp': 0, 'tn': 98, 'fp': 0, 'fn': 37, 'eval_loss': 0.585353812750648}

To Reproduce

I'm using the following code (sorry, the training data isn't shareable):

from simpletransformers.classification import ClassificationModel
import sqlite3
import pandas as pd
import numpy as np
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Load data
con = sqlite3.connect("training.db")
df = pd.read_sql_query("SELECT content, class from data", con)
con.close()

# Split into train and test
df.columns = ["text", "labels"]
train_df, test_df = np.split(df.sample(frac=1), [int(.8*len(df))])

# Optional model configuration
model_args = {
    "num_train_epochs": 10,
}

# Create a ClassificationModel
model = ClassificationModel(
    "xlm", "xlm-roberta-large", use_cuda=True, args=model_args
)

# Train the model
model.train_model(train_df)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test_df)

Expected behavior

I would expect there to be non-zero true/false positives, given that the smaller models do return these. Therefore I'm assuming this is a bug.

Desktop (please complete the following information):

Ubuntu 18.04 LTS (AWS g4dn.2xlarge EC2 instance)

ThilinaRajapakse commented 4 years ago

This is most likely due to the way Transformer models work. There's a tendency for the model to "break" when it's overtrained. I'm not sure why this happens, but it seems more common with the larger models (and higher training epoch numbers), so it's probably overfitting of some sort. If you plot your losses, you will be able to see if this is happening.

_You can easily plot the training progress by specifying wandb_project in your model_args_

234 discusses this problem as well.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ThilinaRajapakse / simpletransformers

Returning no positives and warning "invalid value encountered in double_scalars" #373

234 discusses this problem as well.