How to use a validation dataset when training?

JohnGiorgi / DeCLUTR

The corresponding code from our paper "DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations". Do not hesitate to open an issue if you run into any trouble!

https://aclanthology.org/2021.acl-long.72/

Apache License 2.0

378 stars 33 forks source link

How to use a validation dataset when training? #254

Closed piegu closed 2 years ago

piegu commented 2 years ago

Hi @JohnGiorgi,

In your notebook training.ipynb, you do not use a validation dataset. Why? This is mandatory when training a ML model in order to check that the model is not overfitting, no?

I did try through the following code to use a validation dataset at the end of each training epoch but the batch_loss (and then, the validation loss) is always 0.00 (as if, there was no data in the validation file which is not the case): do you know why and what to correct into my code? Thanks.

overrides = (
    f"{{'train_data_path': '{train_data_path}', "
    f"'data_loader.batch_size': {train_batch_size}, "
    f"'trainer.num_gradient_accumulation_steps': {num_gradient_accumulation_steps}, "
    f"'data_loader.num_workers': {train_num_workers}, "

    # training examples / batch size. Not required, but gives us a more informative progress bar during training
    f"'data_loader.batches_per_epoch': {train_batches_per_epoch}, "   
    f"'trainer.optimizer.lr': {lr}, "
    f"'trainer.patience': {patience}, "
    f"'trainer.num_epochs': {num_epochs}, "

    f"'validation_data_path': '{validation_data_path}', "
    f"'validation_data_loader.batch_size': {validation_batch_size}, "
    f"'validation_data_loader.num_workers': {validation_num_workers}, "
    # validation examples / batch size. Not required, but gives us a more informative progress bar during evaluation
    f"'validation_data_loader.batches_per_epoch': {validation_batches_per_epoch}, }}" 
)

!allennlp train "declutr_small.jsonnet" \
    --serialization-dir "$output" \
    --overrides "$overrides" \
    --include-package "declutr" \
    -f

JohnGiorgi commented 2 years ago

Hi @piegu, you can see #190 for an explanation on why we don't use a validation set during pre-training. When we evaluate the model on SentEval (after pre-training), we do have validation sets for each of the tasks which we used to guide model development and hyperparm tuning.

I am really not sure why you are seeing a loss of 0. How many examples are in this file? Also, it might be helpful to see the values for all the variables you are using in the overrides here. Finally note that as per #190, we didn't use a validation set to measure the performance of the self-supervised pre-training objectives. Instead we used average performance across the validation sets of SentEval, so this "feature" has not been tested and I doubt it works as expected.

piegu commented 2 years ago

How many examples are in this file? 9840

@JohnGiorgi: even if you think it is not useful here, can you test a validation dataset on you side with your notebook training.ipynb?

I think that the contrastive training loss function for the validation dataset is not found at the validation time (end of epoch).

Here is a modified code from your notebook training.ipynb:

validation_data_path = "validation.txt"

overrides = (
    f"{{'train_data_path': '{train_data_path}', "
    # lower the batch size to be able to train on Colab GPUs
    "'data_loader.batch_size': 2, "
    # training examples / batch size. Not required, but gives us a more informative progress bar during training
    "'data_loader.batches_per_epoch': 8912, "
    f"'validation_data_path': '{validation_data_path}',}}"
)

!allennlp train "declutr_small.jsonnet" \
    --serialization-dir "$output" \
    --overrides "$overrides" \
    --include-package "declutr" \
    -f

JohnGiorgi commented 2 years ago

Can I ask what your plans are for using the model and what you hope to achieve by measuring the contrastive loss on a validation set?

I’m not really sure how much work it would be to support using a validation set to measure the contrastive loss performance but it’s unclear to me how helpful that would be (#190) so I don’t really plan to spend time on it.

piegu commented 2 years ago

Hi @JohnGiorgi

Can I ask what your plans are for using the model (...)?

I am participating in an academic project aimed at document similarity. For this, we have built a pipeline which starts by retrieving the embeddings of all the sentences of a document (via a trained DeCLUTR for example), then finding the clusters of sentences which will make it possible to create a document vector in order to calculate the cosine similarity with the vectors of all other documents.

When I read your paper and the claim that a trained DeCLUTR can outperform other (most famous) embeddings encoders, I started using your notebook training.ipynb to train it with our data.

(...) and what you hope to achieve by measuring the contrastive loss on a validation set?

Well, if you train a model with a dataset without verifying for each checkpoint how well goes the training, you do not know if you need to train with more epochs, with another learning rate, etc. Even with a contrastive loss, you can check if your validation loss continues decreasing or not (in order to avoid overfitting).

Unfortunately, your notebook training.ipynb does not allow that as the contrastive loss is not used when testing through a validation dataset. As a matter of fact, here are some lines published by the allennlp train script when using a validation dataset:

(...)
batch_loss: 0.0000, loss: 0.0000 ||: 100%|##########| 989/989 [18:38<00:00,  1.13s/it]
2022-03-22 09:27:09,791 - INFO - allennlp.training.tensorboard_writer -                        Training |  Validation
2022-03-22 09:27:09,805 - INFO - allennlp.training.tensorboard_writer - loss               |     2.057  |     0.000
2022-03-22 09:27:12,087 - INFO - allennlp.training.checkpointer - Best validation performance so far. Copying weights to 'output/best.th'.
(...)

As you can see, validation loss = 0.00, which is wrong.

You can test it through the modified version of your notebook that I put in Colab: training_with_validation_dataset.ipynb

I don’t really plan to spend time on it

If you change your plan, I'll be happy to help.

JohnGiorgi commented 2 years ago

I am participating in an academic project aimed at document similarity.

Cool! Any reason you want to pre-train our model instead of just using the pre-trained models as is? We have pre-trained models both for general domain text and scientific text. There are also many pre-trained sentence embedding methods that have been proposed before and after DeCLUTR (see https://www.sbert.net/) that you could use "off-the-shelf" (unless your text comes from a unique domain not covered by these models pre-training) or that you could fine-tune on some labelled data (if you have it).

Well, if you train a model with a dataset without verifying for each checkpoint how well goes the training, you do not know if you need to train with more epochs, with another learning rate

I think there's still a major misunderstanding here. Without repeating myself I would encourage you to look at popular self-supervised literature (like BERT or SimCLR) and notice that they don't use validation sets to measure the performance of the self-supervised objectives either (as far as I can tell) so I don't think our choice to do the same is "strange". You would be much better of using some downstream task(s) that you care about to tune the hyperparameters of the pre-training stage. Or, maybe better yet, fine-tune the whole pre-trained encoder with some labelled data (if you have it). See https://www.sbert.net/docs/training/overview.html, which has instructions for fine-tuning sentence encoders on your own data. You can load DeCLUTRs weights into this library easily.

If you change your plan, I'll be happy to help.

Feel free to make a PR or fork the repo, if this is really important to you!

piegu commented 2 years ago

Hi @JohnGiorgi

Any reason you want to pre-train our model instead of just using the pre-trained models as is?

You have pre-trained DeCLUTR models in Portuguese for the Brazilian legal domain (one model) and also for the Brazilian health domain (second model)?

There are also many pre-trained sentence embedding methods that have been proposed before and after DeCLUTR (see https://www.sbert.net/)

Of course, I can use other embeddings encoder models than DeCLUTR but as you proved in your paper that DeCLUTR is better than others (and "easy" to train with unlabeled data), I wanted to use it instead of the others.

(...) or that you could fine-tune on some labelled data (if you have it)

Same thought about language and specific domain (see above).

Without repeating myself I would encourage you to look at popular self-supervised literature (like BERT or SimCLR) and notice that they don't use validation sets to measure the performance of the self-supervised objectives either (as far as I can tell)

Can we open the discussion both about "without validation dataset" and "with validation dataset"?

without validation dataset Jacob Devlin from Google explained about BERT training in 2018 (post) that "The best way to know when to stop pre-training is to take intermediate checkpoints and fine-tune them for a downstream task, and see when that stops helping (by more than some trivial amount).". This makes sens: the downstream task plays the role of the validation dataset: checking at the end of each epoch how well does the model. Thanks to that, you can decide to train on more epochs (or not) and test the best hyperparameters configuration (LR, etc..). (Note: I think we agree on that as you wrote in your precedent message "You would be much better of using some downstream task(s) that you care about to tune the hyperparameters of the pre-training stage.", but not in your paper that gives epochs and LR values without indications about how to test them).
with validation dataset: you can check for example the Hugging Face notebooks, and the notebook language_modeling_from_scratch.ipynb in particular. You will see that using a validation dataset for BERT training on the Language Modeling task is a good way to check how well does the model (validation loss).

JohnGiorgi commented 2 years ago

You have pre-trained DeCLUTR models in Portuguese for the Brazilian legal domain (one model) and also for the Brazilian health domain (second model)?

Gotcha. Sounds like you do need to train from scratch (or possibly check out some of the work on language agnostic sentence embeddings).

Jacob Devlin from Google explained about BERT training in 2018 (https://github.com/google-research/bert/issues/95#issuecomment-437599265) that "The best way...

Yup, I am simply echoing Devlin's advice here. Do you have a downstream task that measures what you care about and could be used for validation, testing, and hyperparam tuning?

but not in your paper that gives epochs and LR values without indications about how to test them).

Hmm, that's not true. Section 4.1, under Training, says: "Hyperparameters were tuned on the SentEval validation sets." Is there confusion about how we did that?

More generally, why not try starting with the default hyperparameters, which worked well for us across SentEvals 18 downstream tasks and 10 probing tasks? I know your domains are quite different but I don't think there's reason to suspect these hyperparameters would perform very poorly.

JohnGiorgi commented 2 years ago

Closing this! @piegu please feel free to re-open if you still have questions.