ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
493 stars 181 forks source link

checkpoint file not read in foundation branch. #311

Closed jungsdao closed 8 months ago

jungsdao commented 8 months ago

I wanted to finetune a model starting from foundation model in foundation branch. I have provided some training structures to foundation model and got a model and checkpoint file. With these at hand, I wanted to provide more structures and this time wanted to start from previous checkpoint file rather than starting again from foundation's checkpoint. However, it doesn't seem to recognize the existence of previous ckeckpoint file which I have copied already. I think it used to read checkpoint file automatically if it's in the checkpoint directory, but in foundation branch, it's not reading it properly. Is it a sort of bug in foundation branch? Following is my fitting command line where I have excluded --foundation_model="medium" to not start from foundation model checkpoint. Many thanks in advance!

mace_run_train \
  --name="umbrella" \
  --energy_key="DFT_energy" \
  --forces_key="DFT_forces" \
  --train_file="training_12.xyz" \
  --valid_fraction=0.1 \
  --E0s="{1:-14.9005442054276, 6:-162.973421385767, 8:-438.578998764142, 45:-3089.70420527816}" \
  --r_max=6.0 \
  --energy_weight=1.0 \
  --forces_weight=1.0 \
  --lr=0.01 \
  --scaling="rms_forces_scaling" \
  --batch_size=16 \
  --max_num_epochs=400 \
  --start_swa=300 \
  --swa \
  --ema \
  --ema_decay=0.99 \
  --amsgrad \
  --error_table='PerAtomMAE' \
  --default_dtype="float64" \
  --device="cuda" \
  --save_cpu \
  --seed=3 
ilyes319 commented 8 months ago

What is the name of your first run? You need to keep the same name. Also, you should keep --foundation_model="medium" to get the right hypers for continuing your training. It will overload the foundation model with the checkpoint if it finds one.

jungsdao commented 8 months ago

name of first run was "umbrella" which is the same as in command line. I have kept --foundation_model="medium" but it it's still not recognizing the previous checkpoint file. If it had recognized it, then it should've started from around epoch 196 since previous checkpoint filename is umbrella_run-3_epoch-196_swa.pt but it starts from epoch 0. Also one of logfile line says 2024-01-31 13:30:10.354 INFO: Using foundation model medium as initial checkpoint. probably it's not starting from umbrella_run-3_epoch-196_swa.pt, but from foundation model's checkpoint.

ilyes319 commented 8 months ago

Can you share the log files for the three runs?

jungsdao commented 8 months ago

This is logfile starting from foundation model. umbrella_run-3.log

And following is logfile which I wanted to start from checkpoint file of previous one. (It is interrupted during training) umbrella_run-3.log

ilyes319 commented 8 months ago

Can you make sure you included --restart_latest in your input?

jungsdao commented 8 months ago

Oh actually it was my mistake of missing --restart_latest. Sorry for the trouble!! After adding that keyword, it starts from epoch 196. Like in this logfile. umbrella_run-3.log

But I see there's epoch None in the very beginning when I work in foundation branch which was not the case in main branch. Does it have any meaning? Anyway I appreciate for pointing out my mistake!

ilyes319 commented 8 months ago

Nice! Epoch None corresponds to before the first new epoch. We added that to track better the fine tuning.