asyml / texar-pytorch

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

https://asyml.io

Apache License 2.0

745 stars 117 forks source link

Set train mode after evaluation in training #311

Closed AvinashBukkittu closed 4 years ago

AvinashBukkittu commented 4 years ago

This PR

Fixes a bug in BERT and GPT2 examples of not setting mode to train after evaluation

Fixes #310

codecov[bot] commented 4 years ago

Codecov Report

Merging #311 into master will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #311   +/-   ##
=======================================
  Coverage   79.91%   79.91%           
=======================================
  Files         133      133           
  Lines       11135    11135           
=======================================
  Hits         8899     8899           
  Misses       2236     2236

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 95f3d64...6d73a8c. Read the comment docs.

huzecong commented 4 years ago

Have you reran the fine-tuning experiments? It would be best to ensure that we're doing it right, and that the results improve compared to when we did it wrong. We might also want to update the README with the new outputs.

AvinashBukkittu commented 4 years ago

Have you reran the fine-tuning experiments? It would be best to ensure that we're doing it right, and that the results improve compared to when we did it wrong. We might also want to update the README with the new outputs.

I did not completely re-run the experiments. I re-ran until the first _eval_epoch call in training, checked the value of mode and current dataset in the iterator right after this call (the mode was eval and current dataset was eval dataset, which is wrong).

Sure, I can update the Readme with the new results.

huzecong commented 4 years ago

Hmm, interesting... The loss values now are much (one order of magnitude) lower than before starting from iter 250, and it seems a bit weird to me. Also, I checked the code at the commit where the README was modified to include the results: https://github.com/asyml/texar-pytorch/blob/e4b68188388dbaa07791528d066b410a6f838de7/examples/bert/config_data.py#L9

It seems that eval_step was set to -1, which means no evaluation is performed until the end of training. In this case, the previous results were not incorrect (although the implementation was still faulty).

AvinashBukkittu commented 4 years ago

Hmm, interesting... The loss values now are much (one order of magnitude) lower than before starting from iter 250, and it seems a bit weird to me. Also, I checked the code at the commit where the README was modified to include the results: https://github.com/asyml/texar-pytorch/blob/e4b68188388dbaa07791528d066b410a6f838de7/examples/bert/config_data.py#L9

It seems that eval_step was set to -1, which means no evaluation is performed until the end of training. In this case, the previous results were not incorrect (although the implementation was still faulty).

We start noticing a stark difference for 300th iteration. Agree, previously, after first call _eval_epoch(), the mode was still set to eval