Error while using foundation branch

jungsdao commented 5 months ago

There's another issue with checkpoint file. I was running a script which worked fine in main branch code but when I change to foundations branch, I get following error message. I think this error happened when trying to read checkpoint file of previous run. Checkpoint file is located in the folder as checkpoints/umbrella_run-123_epoch-3892_swa.pt Am I missing something again? or is it sort of bug in foundation branch? Any comment would be appreciated. Thank you

Traceback (most recent call last):
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/cli/run_train.py", line 525, in main
    opt_start_epoch = checkpoint_handler.load_latest(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 215, in load_latest
    self.builder.load_checkpoint(state=state, checkpoint=checkpoint, strict=strict)
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 41, in load_checkpoint
    state.optimizer.load_state_dict(checkpoint["optimizer"])
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/torch/optim/optimizer.py", line 385, in load_state_dict
    raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hjung/miniforge3/envs/mace_env/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/cli/run_train.py", line 531, in main
    opt_start_epoch = checkpoint_handler.load_latest(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 210, in load_latest
    result = self.io.load_latest(swa=swa, device=device)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 171, in load_latest
    path = self._get_latest_checkpoint_path(swa=swa)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 149, in _get_latest_checkpoint_pa
th
    latest_checkpoint_info = max(
                             ^^^^
ValueError: max() arg is an empty sequence
Traceback (most recent call last):
  File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/0_CHO/batch_gap_fit.py", line 751, in <module>
    main(ecutwfc = ecutwfc, cv_range = args.cv_range, verbose=True)
  File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/0_CHO/batch_gap_fit.py", line 741, in main
    get_mace(training_configs, mace_name=mace_name, mace_fit_params=mace_params, input_files=[prev_checkpoint], prev_checkpoint_file=prev_c
heckpoint, run_dir=f"MACE/fit_{fit_idx}")
  File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/0_CHO/batch_gap_fit.py", line 408, in get_mace
    fit(in_configs, mace_name, mace_fit_params, mace_fit_cmd,
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/fit/mace.py", line 190, in fit
    raise exc
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/fit/mace.py", line 178, in fit
    subprocess.run(mace_fit_cmd, shell=True, check=True)
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '/home/hjung/miniforge3/envs/mace_env/bin/mace_run_train --E0s='{1: -14.9005442054276, 6: -162.97342
1385767, 8: -438.578998764142, 45: -3089.70420527816}' --energy_key='DFT_energy' --forces_key='DFT_forces' --correlation=3 --device='cuda' 
--ema --energy_weight=1 --forces_weight=10 --error_table='PerAtomMAE' --eval_interval=4 --max_L=1 --max_num_epochs=4200 --model='MACE' --na
me='umbrella' --num_channels=128 --num_interactions=2 --patience=30 --r_max=6.0 --restart_latest --save_cpu --scheduler_patience=15 --start
_swa=4050 --swa --train_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/0_CHO/MACE/training_13.xyz' --batch_size=16 --val
id_batch_size=16 --valid_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/0_CHO/MACE/validation_13.xyz'' returned non-zero
 exit status 1.

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context Add any other context about the problem here.

ilyes319 commented 5 months ago

Did you keep --foundation_model="medium" while restarting? It is essential for it to load the checkpoint properly.

jungsdao commented 5 months ago

This error is persisting even if I have --foundation_model="medium" within keywords as can be seen from following error message.

Traceback (most recent call last):
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/cli/run_train.py", line 525, in main
    opt_start_epoch = checkpoint_handler.load_latest(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 215, in load_latest
    self.builder.load_checkpoint(state=state, checkpoint=checkpoint, strict=strict)
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 40, in load_checkpoint
    state.model.load_state_dict(checkpoint["model"], strict=strict)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ScaleShiftMACE:
    size mismatch for radial_embedding.bessel_fn.bessel_weights: copying a param with shape torch.Size([8]) from checkpoint, the shape 
in current model is torch.Size([10]).
    size mismatch for interactions.0.conv_tp_weights.layer0.weight: copying a param with shape torch.Size([8, 64]) from checkpoint, the
 shape in current model is torch.Size([10, 64]).
    size mismatch for interactions.0.skip_tp.weight: copying a param with shape torch.Size([262144]) from checkpoint, the shape in curr
ent model is torch.Size([65536]).
    size mismatch for interactions.0.skip_tp.output_mask: copying a param with shape torch.Size([2048]) from checkpoint, the shape in c
urrent model is torch.Size([512]).
    size mismatch for interactions.1.conv_tp_weights.layer0.weight: copying a param with shape torch.Size([8, 64]) from checkpoint, the
 shape in current model is torch.Size([10, 64]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hjung/miniforge3/envs/mace_env/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/cli/run_train.py", line 531, in main
    opt_start_epoch = checkpoint_handler.load_latest(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 210, in load_latest
    result = self.io.load_latest(swa=swa, device=device)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 171, in load_latest
    path = self._get_latest_checkpoint_path(swa=swa)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 149, in _get_latest_checkpoint_pa
th
    latest_checkpoint_info = max(
                             ^^^^
ValueError: max() arg is an empty sequence
Traceback (most recent call last):
  File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/COH_dissociation.py", line 749, in <module>
    main(ecutwfc = ecutwfc, cv_range = args.cv_range, verbose=True)
  File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/COH_dissociation.py", line 739, in main
    get_mace(training_configs, mace_name=mace_name, mace_fit_params=mace_params, input_files=[prev_checkpoint], prev_checkpoint_file=prev_c
heckpoint, run_dir=f"MACE/fit_{fit_idx}")
  File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/COH_dissociation.py", line 408, in get_mace
    fit(in_configs, mace_name, mace_fit_params, mace_fit_cmd,
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/fit/mace.py", line 190, in fit
    raise exc
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/fit/mace.py", line 178, in fit
    subprocess.run(mace_fit_cmd, shell=True, check=True)
  File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '/home/hjung/miniforge3/envs/mace_env/bin/mace_run_train --E0s='{1: -14.9005442054276, 6: -162.97342
1385767, 8: -438.578998764142, 45: -3089.70420527816}' --energy_key='DFT_energy' --forces_key='DFT_forces' --foundation_model='medium' --co
rrelation=3 --device='cuda' --ema --energy_weight=1 --forces_weight=10 --error_table='PerAtomMAE' --eval_interval=4 --max_L=1 --max_num_epo
chs=4200 --model='MACE' --name='umbrella' --num_channels=128 --num_interactions=2 --patience=30 --r_max=6.0 --restart_latest --save_cpu --s
cheduler_patience=15 --start_swa=4050 --swa --train_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/MACE/training_1
3.xyz' --batch_size=16 --valid_batch_size=16 --valid_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/MACE/validatio
n_13.xyz'' returned non-zero exit status 1.

ilyes319 commented 5 months ago

Can you paste your input script?

jungsdao commented 5 months ago

This is my mace fitting command. Is this what you asked for??

mace_run_train --E0s='{1: -14.9005442054276, 6: -162.973421385767, 8: -438.578998764142, 45: -3089.70420527816}' --energy_key='DFT_energy' --forces_key='DFT_forces' --foundation_model='medium' --correlation=3 --device='cuda' --ema --energy_weight=1 --forces_weight=10 --error_table='PerAtomMAE' --eval_interval=4 --max_L=1 --max_num_epochs=4200 --model='MACE' --name='umbrella' --num_channels=128 --num_interactions=2 --patience=30 --r_max=6.0 --restart_latest --save_cpu --scheduler_patience=15 --start_swa=4050 --swa --train_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/MACE/training_13.xyz' --batch_size=16 --valid_batch_size=16 --valid_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/MACE/validation_13.xyz'

ilyes319 commented 5 months ago

yes thx and the log file for this run

jungsdao commented 5 months ago

This is logfile. umbrella_run-123.log

I'm not sure it's related, but please note that I'm trying this on checkpoint file which has not started from foundation model.

ilyes319 commented 5 months ago

Ok I misunderstood then. are you trying to continue a checkpoint that started training on the main branch on the foundation branch? The problem here is that your input command does not match the one you used to start this checkpoint for some reason. Please try to do a single run on the same branch as much as possible.

jungsdao commented 5 months ago

Ok, It seems that mace fit command from main branch is not the same as one in the foundation branch yet. Then I'll try to distinguish those two and not mix them up. thank you for clarification!

ACEsuit / mace

Error while using foundation branch #312