[BUG] Loss drops to 0 after a few thousand steps when using fp16=True

silpara commented 2 years ago

The model training loss is suddenly dropping to 0 after over 1000 steps. I've tried iterating over different dataset as well but got the same behaviour.

Details

I am following the notebook Transformers4Rec/examples/tutorial, to train a next item click prediction model for my own dataset on sequence of items.

Params which I've changed are: learning_rate=0.01, fp16=True, per_device_train_batch_size = 64, d_model=16, rest are as in the notebook. Following are the logs for 1st day of data.


{'loss': 14.3249, 'learning_rate': 0.009976389135451803, 'epoch': 0.01}
{'loss': 14.083, 'learning_rate': 0.009964554115628145, 'epoch': 0.01}
{'loss': 13.9319, 'learning_rate': 0.009953074146399196, 'epoch': 0.01}
{'loss': 13.8982, 'learning_rate': 0.009947452511982957, 'epoch': 0.02}
{'loss': 12.6002, 'learning_rate': 0.009938812947511687, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.009926977927688029, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.00991514290786437, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.009903307888040712, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.009891472868217054, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.009879637848393396, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.009867802828569739, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.00985596780874608, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009844132788922422, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009832297769098764, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009820462749275106, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.009808627729451447, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.00979679270962779, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.00978495768980413, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009773122669980473, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009761287650156814, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009749452630333156, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 0.009737617610509498, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 0.00972578259068584, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009713947570862181, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009702112551038523, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009690277531214864, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.009678442511391206, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.009666607491567548, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.00965477247174389, 'epoch': 0.11}
{'loss': 0.0, 'learning_rate': 0.009642937451920231, 'epoch': 0.11}```

Additionally, I am using merlin container nvcr.io/nvidia/merlin/merlin-pytorch-training:22.05 for training.

Any suggestions on what might be the issue here?

silpara commented 2 years ago

I looked at the weights of the checkpoint and they are all nan for models saved after loss drops to zero.

rnyak commented 2 years ago

@silpara can you please remove fp16=True and test again?

silpara commented 2 years ago

@rnyak Sure, will try it out. Do you suspect its exploding/vanishing gradient problem?

rnyak commented 2 years ago

@silpara Idk yet, first we need to know if it trains fine once you remove fp16=True? if yes, then we need to investigate what's wrong with training with half precision. If you still get loss 0 with/without fp16, then change your learning rate, and be sure you dont have any leaky feature in your datasets.

silpara commented 2 years ago

Training with fp16=False does seem to work fine.

rnyak commented 2 years ago

@silpara I am turning this to a bug ticket so that we can follow up with it.

alan-ai-learner commented 1 year ago

Hi @rnyak , I'm training the t4rec model on custom data, but the loss is not decreasing after few epochs, instead it started increasing. Basically the loss started from 13.67 and after training for few epoch it get decreased to 6.43 and then it started increasing, I'm not sure what can be done to improve the loss more. Here are my params:

params = {
    'batch_size': 1024,
    'lr': 0.0005,
    'lr_scheduler': 'cosine',
    'num_train_epochs': 1,
    'using_test': True,
    'using_type': False,
    'bl_shuffle': True,
    'masking': 'mlm',
    'd_model': 256,
    'n_head': 32,
    'n_layer': 3,
    'proj_num': 1,
    'act_mlp': 'None',
    'item_correction': False,
    'neg_factor': 4,
    'label_smoothing': 0.0,
    'temperature': 1.5734215681668653,
    'remove_false_neg': True,
    'item_correction_factor': 0.04152252077012748,
    'transformer_dropout': 0.05096800263401626,
    'mlm_probability': 0.35044384745899415,
    'top20': True,
    'loss_types': True,
    'loss_types_type': 'Simple',
    'multi_task_emb': 0,
    'mt_num_layers': 1,
    'use_tanh': False,
    'seq_len': 20,
    'split': 0
}

Any suggesstion would be very helpful. Thanks in Advance!!

NVIDIA-Merlin / Transformers4Rec

[BUG] Loss drops to 0 after a few thousand steps when using fp16=True #493

Details