Open silpara opened 2 years ago
I looked at the weights of the checkpoint and they are all nan for models saved after loss drops to zero.
@silpara can you please remove fp16=True and test again?
@rnyak Sure, will try it out. Do you suspect its exploding/vanishing gradient problem?
@silpara Idk yet, first we need to know if it trains fine once you remove fp16=True
? if yes, then we need to investigate what's wrong with training with half precision. If you still get loss 0 with/without fp16, then change your learning rate, and be sure you dont have any leaky feature in your datasets.
Training with fp16=False
does seem to work fine.
@silpara I am turning this to a bug ticket so that we can follow up with it.
Hi @rnyak , I'm training the t4rec model on custom data, but the loss is not decreasing after few epochs, instead it started increasing. Basically the loss started from 13.67 and after training for few epoch it get decreased to 6.43 and then it started increasing, I'm not sure what can be done to improve the loss more. Here are my params:
params = {
'batch_size': 1024,
'lr': 0.0005,
'lr_scheduler': 'cosine',
'num_train_epochs': 1,
'using_test': True,
'using_type': False,
'bl_shuffle': True,
'masking': 'mlm',
'd_model': 256,
'n_head': 32,
'n_layer': 3,
'proj_num': 1,
'act_mlp': 'None',
'item_correction': False,
'neg_factor': 4,
'label_smoothing': 0.0,
'temperature': 1.5734215681668653,
'remove_false_neg': True,
'item_correction_factor': 0.04152252077012748,
'transformer_dropout': 0.05096800263401626,
'mlm_probability': 0.35044384745899415,
'top20': True,
'loss_types': True,
'loss_types_type': 'Simple',
'multi_task_emb': 0,
'mt_num_layers': 1,
'use_tanh': False,
'seq_len': 20,
'split': 0
}
Any suggesstion would be very helpful. Thanks in Advance!!
The model training loss is suddenly dropping to 0 after over 1000 steps. I've tried iterating over different dataset as well but got the same behaviour.
Details
I am following the notebook Transformers4Rec/examples/tutorial, to train a next item click prediction model for my own dataset on sequence of items.
Params which I've changed are: learning_rate=0.01, fp16=True, per_device_train_batch_size = 64, d_model=16, rest are as in the notebook. Following are the logs for 1st day of data.