Open prokotg opened 3 years ago
Hello, I ran into the same problem as you during gan training. I'm sorry I didn't understand your solution, can you send me a repaired fairseq.logging.progress_bar.py? My email is getwebshells@gmail.com. Thank you very much for your help. If possible, can I contact you by email?
It's best if we communicate here, someone else might benefit too :))
There's my PR linked to this issue where both fairseq/logging/progress_bar.py
and fairseq_cli/train.py
are properly changed.
To describe the problem a little bit more: hydra
sweeper is running multiple configs in series but it is all within one process which means global variables are shared. If you look at how objects are written to _tensorboard_writers
you will see that the keys correspond to data splits valid
, train
. The problem is that when sweeper is running the next configuration these writers are not cleared and so they are being re-used since these keys are used in every training procedure. So writers are re-used but location changes and because writers are opened they are tryingt o write to a file with the name corresponding to the old configuration but in the new place, hence the error. If you look at the tfevent file in the previous configuration you willl notice that the name it is trying to write in new configuration is the same., which in general should not take place.
Ok, using your PR has successfully solved my problem. Now I can train for a long time without being interrupted by loading new configurations. I am currently conducting a wav2vec-U experiment on the timit data set, using the same parameter settings as in the paper, but during the training process, the loss shows an upward trend and it is difficult to converge, resulting in the final WER as high as 80%. Have you encountered any problems in this regard, and would you like to give me some suggestions?
🐛 Bug
Tensorboard writers are not cleared between hydra configurations
To Reproduce
This problem was spotted while running training of Wav2Vec-U with default parameters.:
When hydra switches to a different configuration , a
_tensorboard_writers
fromfairseq.logging.progress_bar.py
are not cleared which results in aSummaryWriter
trying to write to a file named as in previous configuration but in a different directory. Writers from previous configurations are still used because keys in_tensorboard_writers
arevalid
,train
which is the same across configs.I will attach a pull request as my (simple) proposition to resolve this.
Expected behavior
Continue training with different configuration without thrown exception.
Environment
pip
, source): source