Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.92k stars 3.34k forks source link

Add retry logic for I/O operations under Windows #9705

Closed cowwoc closed 2 years ago

cowwoc commented 2 years ago

🐛 Bug

When running under Windows, I get intermittent I/O errors such as:

Traceback (most recent call last):
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\predict_outdoor_temperature.py", line 820, in <module>
    main()
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\predict_outdoor_temperature.py", line 799, in main
    tune_hyperparameters(graph_queue)
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\predict_outdoor_temperature.py", line 705, in tune_hyperparameters
    study.optimize(lambda trial: optimize_train(trial, graph),
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\optuna\study\study.py", line 400, in optimize
    _optimize(
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\optuna\study\_optimize.py", line 66, in _optimize
    _optimize_sequential(
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\optuna\study\_optimize.py", line 163, in _optimize_sequential
    trial = _run_trial(study, func, catch)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\optuna\study\_optimize.py", line 264, in _run_trial
    raise func_err
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\optuna\study\_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\predict_outdoor_temperature.py", line 705, in <lambda>
    study.optimize(lambda trial: optimize_train(trial, graph),
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\predict_outdoor_temperature.py", line 671, in optimize_train
    return train(dataset, learning_rate, max_epochs, seq2seq_type, seq2seq_layers, linear_layers,
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\predict_outdoor_temperature.py", line 445, in train
    trainer.fit(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 552, in fit
    self._run(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 917, in _run
    self._dispatch()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 985, in _dispatch
    self.accelerator.start_training(self)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 995, in run_stage
    return self._run_train()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1044, in _run_train
    self.fit_loop.run()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\base.py", line 112, in run
    self.on_advance_end()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 177, in on_advance_end
    self._run_validation()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 257, in _run_validation
    self.val_loop.run()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\base.py", line 118, in run
    output = self.on_run_end()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 141, in on_run_end
    self.on_evaluation_end()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 204, in on_evaluation_end
    self.trainer.call_hook("on_validation_end", *args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1219, in call_hook
    trainer_hook(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 229, in on_validation_end
    callback.on_validation_end(self, self.lightning_module)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 322, in on_validation_end
    self.save_checkpoint(trainer)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 386, in save_checkpoint
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 750, in _save_none_monitor_checkpoint
    self._del_model(trainer, self.best_model_path)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 528, in _del_model
    self._fs.rm(filepath)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\fsspec\implementations\local.py", line 149, in rm
    os.remove(p)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:/Users/Gili/Documents/myproject/aggregator/src/main/python/ai/lightning_logs/version_163/checkpoints/epoch=149-step=149.ckpt'

As far as I can tell, there are no workarounds for this. I expect the library to retry for upwards of 30 seconds in this case because files are routinely locked by antivirus scanners and so on. In an ideal world, you would add this retry mechanism for all I/O operations across the entire library...

carmocca commented 2 years ago

It's probably better that you raise this issue in https://github.com/intake/filesystem_spec

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!