eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
24 stars 6 forks source link

The `cleanup`method of the `TrainingModelSaver` returns `FileNotFoundError` #9

Closed l-k-11235 closed 1 month ago

l-k-11235 commented 1 month ago

I got this error when the model saver tried to clean the checkpoint directory:

Traceback (most recent call last):
  File "/usr/local/bin/eole", line 33, in <module>
    sys.exit(load_entry_point('EOLE', 'console_scripts', 'eole')())
  File "wokdir/eole/eole/bin/main.py", line 39, in main
    bin_cls.run(args)
  File "/wokdir/eole/eole/bin/run/train.py", line 68, in run
    train(config)
  File "/wokdir//eole/eole/bin/run/train.py", line 55, in train
    train_process(config, device_id=0)
  File "/wokdir//eole/eole/train_single.py", line 248, in main
    trainer.train(
  File "/wokdir/eole/eole/trainer.py", line 363, in train
    self.model_saver.save(step, moving_average=self.moving_average)
  File "/wokdir//eole/eole/models/model_saver.py", line 319, in save
    self._save(step)
  File "/wokdir/eole/eole/models/model_saver.py", line 298, in _save
    self.cleanup()
  File "/wokdir//eole/eole/models/model_saver.py", line 135, in cleanup
    shutil.rmtree(step_dir_to_delete)
  File "/usr/lib/python3.10/shutil.py", line 715, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 713, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: 'step_500'