alasdairtran / transform-and-tell

[CVPR 2020] Transform and Tell: Entity-Aware News Image Captioning
https://transform-and-tell.ml/
89 stars 14 forks source link

Question on Training Epoch #25

Open tjuwyh opened 3 years ago

tjuwyh commented 3 years ago

Hi Alasdair, Thank you for your great work. I have two questions. Hope you can help me out.

  1. How many epochs of training are required for obtaining the results shown in the paper? I have tried to re-train the "5_transformer_roberta" variant from scratch and I get a CIDEr score of 35.6 after training for 35 epochs on the GoodNews dataset. It is quite low compared to the result reported in the paper (CIDEr:48.5). However, the paper mentioned that "training is stopped after the model has seen 6.6 million examples. This is equivalent to 16 epochs on GoodNews and 9 epochs on NYTimes800k", which makes me confused. Besides, I observe that in the provided checkpoint files (metrics_epoch_99.json) of "5_transformer_roberta", the value of "best_epoch" is 99. Does it mean the checkpoint with the best performances is obtained after training for 99 epochs?

  2. Does the training process can be early stopped? If so, could you be so kind to give me some guidance on how it works?

Thanks!

alasdairtran commented 3 years ago

If you look at the relevant config file, you can see that I trained it for 100 epochs. But in AllenNLP, an epoch doesn't necessarily go through the whole dataset. In the same config file, the parameter iterator.instances_per_epoch is set to 65536. So each epoch only iterates through 65536 examples. I set it to this number so that I could get a checkpoint roughly every hour (AllenNLP only checkpoints at the end of an epoch).

If you have trained the model for 35 "full epochs", then you have probably trained it for twice as long as my experiments. Since you're getting a low CIDEr, it could be due to the optimizer. Are you using the same optimizer and scheduler settings (see the settings under trainer in the config file)?

And yep, the best epoch for this experiment is epoch 99 (the final epoch).

None of the experiments I ran early stopped. It took me roughly 4 days to train it on a Titan V GPU. I actually think you can push the performance even further by simply training it for longer, since if you visualize the learning curve, it seems to haven't plateaued even after 100 epochs.

If you'd like to reproduce the numbers reported in the papers, the easiest way is to set up the conda environment (see the README file) and the mongo database (send me an email if you haven't got the mongo dump), and then run:

tell train expt/goodnews/5_transformer_roberta/config.yaml -f

which would hopefully give you roughly the same numbers after 4 days.

Hope that helps :-)

tjuwyh commented 3 years ago

Thanks, Tran! Actually, I followed the same optimizer and scheduler setting in the provided config file. But I didn't realize that one epoch in the default setting doesn't mean a "full epoch". So I think just continuing training till 100 epochs may produce relatively reasonable results. Thanks for your help again.

tjuwyh commented 3 years ago

If you look at the relevant config file, you can see that I trained it for 100 epochs. But in AllenNLP, an epoch doesn't necessarily go through the whole dataset. In the same config file, the parameter iterator.instances_per_epoch is set to 65536. So each epoch only iterates through 65536 examples. I set it to this number so that I could get a checkpoint roughly every hour (AllenNLP only checkpoints at the end of an epoch).

If you have trained the model for 35 "full epochs", then you have probably trained it for twice as long as my experiments. Since you're getting a low CIDEr, it could be due to the optimizer. Are you using the same optimizer and scheduler settings (see the settings under trainer in the config file)?

And yep, the best epoch for this experiment is epoch 99 (the final epoch).

None of the experiments I ran early stopped. It took me roughly 4 days to train it on a Titan V GPU. I actually think you can push the performance even further by simply training it for longer, since if you visualize the learning curve, it seems to haven't plateaued even after 100 epochs.

If you'd like to reproduce the numbers reported in the papers, the easiest way is to set up the conda environment (see the README file) and the mongo database (send me an email if you haven't got the mongo dump), and then run:

tell train expt/goodnews/5_transformer_roberta/config.yaml -f

which would hopefully give you roughly the same numbers after 4 days.

Hope that helps :-)

I have attempted to fine-tune the trained checkpoint (trained for 100 epochs already) to see whether it can achieve better performance. Particularly, I change the num_epochs in config.yaml into 120. However, a ConfigurationError raised("Training configuration does not match the configuration we're recovering from.") Do you have any idea how to handle it rather than training from scratch?

alasdairtran commented 3 years ago

This is a bit of a hack but you can just restore the weights manually in the init function of the model. For example, have a look at how I restored the model weights in TransformerPointerModel here and how I then specify the path to the checkpoint here. And then you can change num_epochs to 20, for example. You might also want to change t_total (the total number of training steps) to let the learning rate scheduler know in advance how many steps you're training for.

tjuwyh commented 3 years ago

Thanks! I'm kind of interested in the TransformerPointerModel since it is not included in your paper. Does it mean the Transformer model with a pointer network? And how it works?

alasdairtran commented 3 years ago

Yeah, it uses a pointer network. I started with the observation that our best model was struggling with generating rare entity names. So I tried to implement TransformerPointerModel, where at each step, the model gets to decide whether or not to copy a word from the context article. During training, we label all named entities in the caption as words that the model should copy. The hypothesis was that by giving the model the ability to copy words, it would be able to recognise named entities more accurately.

However this wasn't able to beat the performance of the model without copying, so I didn't include the results in the paper. But I kept the code and config anyway. Maybe you might have better luck :-)