facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.03k stars 6.35k forks source link

Tensorboard writers are not cleared between hydra configurations #3817

Open prokotg opened 2 years ago

prokotg commented 2 years ago

🐛 Bug

Tensorboard writers are not cleared between hydra configurations

To Reproduce

This problem was spotted while running training of Wav2Vec-U with default parameters.:

TASK_DATA=/path/to/features/precompute_unfiltered_pca512_cls128_mean_pooled  
TEXT_DATA=/path/to/data/phones  # path to fairseq-preprocessed GAN data (phones dir)
KENLM_PATH=/path/to/data/phones/kenlm.phn.o4.bin  # KenLM 4-gram phoneme language model (LM data = GAN data here)

PYTHONPATH=$FAIRSEQ_ROOT PREFIX=$PREFIX fairseq-hydra-train \
    -m --config-dir config/gan \
    --config-name w2vu \
    task.data=${TASK_DATA} \
    task.text_data=${TEXT_DATA} \
    task.kenlm_path=${KENLM_PATH} \
    common.user_dir=${FAIRSEQ_ROOT}/examples/wav2vec/unsupervised \
    model.code_penalty=2,4 model.gradient_penalty=1.5,2.0 \
    model.smoothness_weight=0.5,0.75,1.0 'common.seed=range(0,5)'

When hydra switches to a different configuration , a _tensorboard_writers from fairseq.logging.progress_bar.py are not cleared which results in a SummaryWriter trying to write to a file named as in previous configuration but in a different directory. Writers from previous configurations are still used because keys in _tensorboard_writers are valid, train which is the same across configs.

Traceback (most recent call last):
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self._record_writer.write(data)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 519, in write
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 150, in append
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 154, in _write
    with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'tb/valid/events.out.tfevents.1629890258.somemachine.3252802.0'

^CTraceback (most recent call last):
  File "somepath/someuser/miniconda3/envs/w2vu/bin/fairseq-hydra-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-hydra-train')())
  File "somepath/someuser/fairseq/fairseq_cli/hydra_train.py", line 84, in cli_main
    hydra_main()
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/hydra/main.py", line 37, in decorated_main
    strict=strict,
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 355, in _run_hydra
    lambda: hydra.multirun(
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 358, in <lambda>
    overrides=args.overrides,
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 136, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 80, in launch
    job_subdir_key="hydra.sweep.subdir",
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/hydra/core/utils.py", line 129, in run_job
    ret.return_value = task_function(task_cfg)
  File "somepath/someuser/fairseq/fairseq_cli/hydra_train.py", line 28, in hydra_main
    _hydra_main(cfg)
  File "somepath/someuser/fairseq/fairseq_cli/hydra_train.py", line 53, in _hydra_main
    distributed_utils.call_main(cfg, pre_main, **kwargs)
  File "somepath/someuser/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "somepath/someuser/fairseq/fairseq_cli/train.py", line 173, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "somepath/someuser/fairseq/fairseq_cli/train.py", line 302, in train
    cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch
  File "somepath/someuser/fairseq/fairseq_cli/train.py", line 388, in validate_and_save
    valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)
  File "somepath/someuser/fairseq/fairseq_cli/train.py", line 466, in validate
    progress.print(stats, tag=subset, step=trainer.get_num_updates())
  File "somepath/someuser/fairseq/fairseq/logging/progress_bar.py", line 374, in print
    self._log_to_tensorboard(stats, tag, step)
  File "somepath/someuser/fairseq/fairseq/logging/progress_bar.py", line 392, in _log_to_tensorboard
    writer.add_scalar(key, stats[key], step)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 349, in add_scalar
    self._get_file_writer().add_summary(summary, global_step, walltime)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 96, in add_summary
    self.add_event(event, global_step, walltime)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 81, in add_event
    self.event_writer.add_event(event)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 113, in add_event
    self._async_writer.write(event.SerializeToString())
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 166, in write
    self._byte_queue.put(bytestring)
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/queue.py", line 139, in put
    self.not_full.wait()
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 176, in flush
    self._byte_queue.join()
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/queue.py", line 89, in join
    self.all_tasks_done.wait()
  File "somepath/someuser/miniconda3/envs/w2vu/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt

I will attach a pull request as my (simple) proposition to resolve this.

Expected behavior

Continue training with different configuration without thrown exception.

Environment

lsrami commented 2 years ago

Hello, I ran into the same problem as you during gan training. I'm sorry I didn't understand your solution, can you send me a repaired fairseq.logging.progress_bar.py? My email is getwebshells@gmail.com. Thank you very much for your help. If possible, can I contact you by email?

prokotg commented 2 years ago

It's best if we communicate here, someone else might benefit too :))

There's my PR linked to this issue where both fairseq/logging/progress_bar.py and fairseq_cli/train.py are properly changed.

To describe the problem a little bit more: hydra sweeper is running multiple configs in series but it is all within one process which means global variables are shared. If you look at how objects are written to _tensorboard_writers you will see that the keys correspond to data splits valid, train. The problem is that when sweeper is running the next configuration these writers are not cleared and so they are being re-used since these keys are used in every training procedure. So writers are re-used but location changes and because writers are opened they are tryingt o write to a file with the name corresponding to the old configuration but in the new place, hence the error. If you look at the tfevent file in the previous configuration you willl notice that the name it is trying to write in new configuration is the same., which in general should not take place.

lsrami commented 2 years ago

Ok, using your PR has successfully solved my problem. Now I can train for a long time without being interrupted by loading new configurations. I am currently conducting a wav2vec-U experiment on the timit data set, using the same parameter settings as in the paper, but during the training process, the loss shows an upward trend and it is difficult to converge, resulting in the final WER as high as 80%. Have you encountered any problems in this regard, and would you like to give me some suggestions?