ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
493 stars 181 forks source link

run_train.py crashes if swa fails to make progress #339

Closed bernstei closed 6 months ago

bernstei commented 6 months ago

If for whatever reason swa makes no progress, you get the error

/home/Software/python/system/torch/2.0.1/gpu/lib64/python3.9/site-packages/torch/jit/_check.py:172: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn("The TorchScript type system doesn't support "
Traceback (most recent call last):
  File "/home/cluster2/bernstei/.local/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/cli/run_train.py", line 594, in main
    epoch = checkpoint_handler.load_latest(
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/tools/checkpoint.py", line 210, in load_latest
    result = self.io.load_latest(swa=swa, device=device)
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/tools/checkpoint.py", line 171, in load_latest
    path = self._get_latest_checkpoint_path(swa=swa)
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/tools/checkpoint.py", line 152, in _get_latest_checkpoint_path
    return latest_checkpoint_info.path
UnboundLocalError: local variable 'latest_checkpoint_info' referenced before assignment

The run got to the end, but seems to crash when trying to save the best regular and swa checkpoints?

I'm guessing that latest_checkpoint_info is not defined because swa never made progress and hence no checkpoints were written during the swa phase.

pobo95 commented 6 months ago

What is your --max_num_epochs and --start_swa? Maybe --start_swa is higher than --max_num_epochs.

ilyes319 commented 6 months ago

Fixed it in the latest merge of repulsion branch to develop.