Closed jungsdao closed 5 months ago
Did you keep --foundation_model="medium"
while restarting? It is essential for it to load the checkpoint properly.
This error is persisting even if I have --foundation_model="medium"
within keywords as can be seen from following error message.
Traceback (most recent call last):
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/cli/run_train.py", line 525, in main
opt_start_epoch = checkpoint_handler.load_latest(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 215, in load_latest
self.builder.load_checkpoint(state=state, checkpoint=checkpoint, strict=strict)
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 40, in load_checkpoint
state.model.load_state_dict(checkpoint["model"], strict=strict) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ScaleShiftMACE:
size mismatch for radial_embedding.bessel_fn.bessel_weights: copying a param with shape torch.Size([8]) from checkpoint, the shape
in current model is torch.Size([10]).
size mismatch for interactions.0.conv_tp_weights.layer0.weight: copying a param with shape torch.Size([8, 64]) from checkpoint, the
shape in current model is torch.Size([10, 64]).
size mismatch for interactions.0.skip_tp.weight: copying a param with shape torch.Size([262144]) from checkpoint, the shape in curr
ent model is torch.Size([65536]).
size mismatch for interactions.0.skip_tp.output_mask: copying a param with shape torch.Size([2048]) from checkpoint, the shape in c
urrent model is torch.Size([512]).
size mismatch for interactions.1.conv_tp_weights.layer0.weight: copying a param with shape torch.Size([8, 64]) from checkpoint, the
shape in current model is torch.Size([10, 64]).
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/hjung/miniforge3/envs/mace_env/bin/mace_run_train", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/cli/run_train.py", line 531, in main
opt_start_epoch = checkpoint_handler.load_latest(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 210, in load_latest
result = self.io.load_latest(swa=swa, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 171, in load_latest
path = self._get_latest_checkpoint_path(swa=swa)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/mace/tools/checkpoint.py", line 149, in _get_latest_checkpoint_pa
th
latest_checkpoint_info = max(
^^^^
ValueError: max() arg is an empty sequence
Traceback (most recent call last):
File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/COH_dissociation.py", line 749, in <module>
main(ecutwfc = ecutwfc, cv_range = args.cv_range, verbose=True)
File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/COH_dissociation.py", line 739, in main
get_mace(training_configs, mace_name=mace_name, mace_fit_params=mace_params, input_files=[prev_checkpoint], prev_checkpoint_file=prev_c
heckpoint, run_dir=f"MACE/fit_{fit_idx}")
File "/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/COH_dissociation.py", line 408, in get_mace
fit(in_configs, mace_name, mace_fit_params, mace_fit_cmd,
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/fit/mace.py", line 190, in fit
raise exc
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/fit/mace.py", line 178, in fit
subprocess.run(mace_fit_cmd, shell=True, check=True)
File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '/home/hjung/miniforge3/envs/mace_env/bin/mace_run_train --E0s='{1: -14.9005442054276, 6: -162.97342
1385767, 8: -438.578998764142, 45: -3089.70420527816}' --energy_key='DFT_energy' --forces_key='DFT_forces' --foundation_model='medium' --co
rrelation=3 --device='cuda' --ema --energy_weight=1 --forces_weight=10 --error_table='PerAtomMAE' --eval_interval=4 --max_L=1 --max_num_epo
chs=4200 --model='MACE' --name='umbrella' --num_channels=128 --num_interactions=2 --patience=30 --r_max=6.0 --restart_latest --save_cpu --s
cheduler_patience=15 --start_swa=4050 --swa --train_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/MACE/training_1
3.xyz' --batch_size=16 --valid_batch_size=16 --valid_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/MACE/validatio
n_13.xyz'' returned non-zero exit status 1.
Can you paste your input script?
This is my mace fitting command. Is this what you asked for??
mace_run_train --E0s='{1: -14.9005442054276, 6: -162.973421385767, 8: -438.578998764142, 45: -3089.70420527816}' --energy_key='DFT_energy' --forces_key='DFT_forces' --foundation_model='medium' --correlation=3 --device='cuda' --ema --energy_weight=1 --forces_weight=10 --error_table='PerAtomMAE' --eval_interval=4 --max_L=1 --max_num_epochs=4200 --model='MACE' --name='umbrella' --num_channels=128 --num_interactions=2 --patience=30 --r_max=6.0 --restart_latest --save_cpu --scheduler_patience=15 --start_swa=4050 --swa --train_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/MACE/training_13.xyz' --batch_size=16 --valid_batch_size=16 --valid_file='/work/home/hjung/Calculation/4_Free_energy_calculation/1_Rh/1_COH/MACE/validation_13.xyz'
yes thx and the log file for this run
This is logfile. umbrella_run-123.log
I'm not sure it's related, but please note that I'm trying this on checkpoint file which has not started from foundation model.
Ok I misunderstood then. are you trying to continue a checkpoint that started training on the main branch on the foundation branch? The problem here is that your input command does not match the one you used to start this checkpoint for some reason. Please try to do a single run on the same branch as much as possible.
Ok, It seems that mace fit command from main branch is not the same as one in the foundation branch yet. Then I'll try to distinguish those two and not mix them up. thank you for clarification!
There's another issue with checkpoint file. I was running a script which worked fine in main branch code but when I change to foundations branch, I get following error message. I think this error happened when trying to read checkpoint file of previous run. Checkpoint file is located in the folder as
checkpoints/umbrella_run-123_epoch-3892_swa.pt
Am I missing something again? or is it sort of bug in foundation branch? Any comment would be appreciated. Thank youTo Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context Add any other context about the problem here.