Closed cecilia-hong closed 1 week ago
Hello could share your full log file, that would greatly help. Thank you!
Hi, here it is, many thanks!
Hello, I see that this request has been grouped into similar issues to do with multi-head finetuning. I just want to add that I had tried the same training with multiheads_finetuning=False
but still get the same error. Not sure if that helps at all.
Many thanks!
I think the problem is related to swa and checkpoint handling.
Here's what I assume is happening:
You specify swa_start
, but never enable swa
itself (which should ideally not be possible). So the Checkpoint handler gets a start value to save swa checkpoints from epoch 300. swa is never used (or mentioned in the log), but the checkpoints after epoch 300 are still logged under the swa name.
Because old checkpoints are deleted unless specified otherwise (--keep_checkpoints
), according to the log at epoch 300 the old non-swa checkpoint is deleted and a new "fake-swa" checkpoint is created. Therefore, when MACE searches for a non-swa checkpoint at the end of training, it can't find any and throws an error.
Can you please try running training with --swa
as an additional keyword.
hm.. can you propose a better handling of the various option states?
As a user, it is reasonable to expect, that setting --swa_start
to a value, results in swa actually being used during training from the set epoch onwards.
The mildest solution would be thatt if args.swa_start
is set to a value, there should be a check if swa itself is activated and print out a warning, if it isn't.
The more radical approach would be scrapping args.swa
completely and running swa, if a starting epoch is set and not, if args.swa
is None
. To me that looks like an efficient solution, but there might be some consequences, that I am not aware of.
The in-between solution would be automatically setting args.swa
to True
, if args.start_swa
was set and logging it.
Hello, just want to confirm that adding --swa
does indeed solve my error.
Many thanks!
Hi, hope you are doing well.
Quite often, at the end of my training, I get this error:
and as a result, I do not get a model from the training.
I have not found out what causes this as I usually use the same input script and vary the model size. I noticed a similar issue but that issue was because the training stopped before the swa starting epoch which is not the case with mine. I had also tried increasing the max_num_epochs and restarted the training but once I have reached that epoch, the same error occurs.
My inputs to the training is:
(I have checked my MACE version to be 0.3.7 this time!)
Many thanks in advance!