facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.74k stars 835 forks source link

Why is the DLRM model only trained for one Epoch? #213

Closed swordfate closed 2 years ago

swordfate commented 2 years ago

The accuracy curves shown in README.md are for only one epoch with the Kaggle or Terabyte dataset. However, as we know, deep learning models for NLP are usually trained with multiple epochs with random data augmentation to achieve better accuracy.

So, my question is, why is the DLRM model only trained for one Epoch? Is it necessary to train multiple epochs with the same data?

Looking forward to your reply, thank you very much :)

mnaumovfb commented 2 years ago

You are correct about the training of CV and NLP models with multiple epochs. However, the training of DLRMs with multiple epochs is an open problem. The code supports it, but you will see that while the training accuracy continues to drop, the testing accuracy will increase in the subsequent epochs.

swordfate commented 2 years ago

You are correct about the training of CV and NLP models with multiple epochs. However, the training of DLRMs with multiple epochs is an open problem. The code supports it, but you will see that while the training accuracy continues to drop, the testing accuracy will increase in the subsequent epochs.

Do you mean the training loss continues to drop, the testing loss will increase in the subsequent epochs? So, can I think this is due to multiple epochs training for recommendation system will overfit the training dataset?

mnaumovfb commented 2 years ago

You can think of it in terms of a loss. It's possible that this is related to over fitting, but it's not clear exactly what happens. You can give it a try yourself, just use multiple epochs during training to see the corresponding effect.

tim5go commented 2 years ago

Hi @mnaumovfb

You mentioned:

You are correct about the training of CV and NLP models with multiple epochs. However, the training of DLRMs with multiple epochs is an open problem. The code supports it, but you will see that while the training accuracy continues to drop, the testing accuracy will increase in the subsequent epochs.

It's weird that the training accuracy continues to drop while the testing accuracy will increase in the subsequent epochs. Do you mean the opposite (i.e. over-fitting) ?