Validation data - Githubissues

negar-mokhberian commented 3 years ago

Is there a way to provide validation data for the training phase? How validation data is used in this method?

Thanks.

JohnGiorgi commented 3 years ago

Hmm, I believe you could pass validation data by adding a validation_data_path key in the config but this is not something we tried.

I am not sure how useful this would be though. Assuming it works without any errors, it would spit out a validation loss at the end of each epoch. But the goal here isn't really to obtain the lowest validation loss. Our process is a pretraining step that is designed to induce good performance on downstream tasks, so we use SentEval to evaluate after training has finished.

negar-mokhberian commented 3 years ago

Thank you for your explanation. :)

JohnGiorgi commented 3 years ago

No problem, feel free to re-open or open a new issue if you have any questions :)

piegu commented 2 years ago

I am not sure how useful this would be though. Assuming it works without any errors, it would spit out a validation loss at the end of each epoch. But the goal here isn't really to obtain the lowest validation loss.

Hi @JohnGiorgi, this is the really first time I read that it is not useful/necessary to check the training of a DL model through a validation dataset because of the key point that would be the finetuning of this model on a downstream task.

Without validation dataset, you do not know how well goes your training. You do not know if you need to train 1 epoch or 20, you do not know which value of LR used, and same problem with all training hyperparameters.

The Transfer Learning method (2 steps: pré-training of a model and then fine-tuning on a downstream task) is universaly used in the ML/DL world. Each step is key, not only the fine-tuning one.

If I follow your idea, I would say: "I'm not going to train my DeCLUTR as the fine-tuning on a downstream task will do the job". ;-)

Strange, no?

JohnGiorgi commented 2 years ago

I think there is some confusion here

this is the really first time I read that it is not useful/necessary to check the training of a DL model through a validation dataset

I am not saying that validation sets are somehow not useful. I am saying that I don't know how useful it would be to have a validation set to monitor the loss of the self-supervised objective(s) used during pre-training. We are following the canonical approach here of pre-training our model for some fixed number of steps (e.g. see BERT) using the self-supervised objective(s) on some unlabelled data. We don't really care how well it does on these objectives. We care about the quality of the representations learned during pre-training and how well they transfer to some downstream task(s) of interest.

Without validation dataset, you do not know how well goes your training. You do not know if you need to train 1 epoch or 20, you do not know which value of LR used, and same problem with all training hyperparameters.

To evaluate the effectiveness of pre-training, we use SentEval, a benchmark of tasks designed to measure the quality of learned sentence embeddings. We can take the average performance across the validation sets of the tasks in SentEval to determine how effective our pre-training was. We can also use this is a signal to tune the hyperparameters of the pre-training phase (like # of epochs, LR, etc.)

If I follow your idea, I would say: "I'm not going to train my DeCLUTR as the fine-tuning on a downstream task will do the job". ;-)

I am not sure what this means?

JohnGiorgi / DeCLUTR

Validation data #190