Closed bernstei closed 3 weeks ago
The batches used to evaluate the validation error, at least on the multi-head-finetune branch, are not deterministic.
Here is an example output of num_atoms from the first batch passed to weighted_mean_squared_error_energy from sequential epochs (24 and 25) of a run
num_atoms
weighted_mean_squared_error_energy
BOB weighted_mean_squared_error_energy num_atoms tensor([ 16, 60, 54, 7, 48, 19, 56, 16, 88, 23, 36, 4, 28, 19, 14, 12, 40, 8, 12, 100, 6, 56, 40, 52, 46, 36, 5, 144, 4, 16, 19, 100], device='cuda:0')
and
BOB weighted_mean_squared_error_energy num_atoms tensor([ 36, 19, 40, 16, 56, 54, 8, 36, 46, 100, 24, 4, 1, 4, 36, 22, 128, 8, 7, 28, 12, 14, 100, 19, 56, 37, 144, 8, 16, 8, 50, 40], device='cuda:0')
The suspected cause (@RokasEl in the slack) is https://github.com/ACEsuit/mace/blob/346999c8eeefd806d882cb4c978ff52a7fff625d/mace/cli/run_train.py#L437
Was this closed by 122a50c ?
@ilyes319 you seem to have fixed it in the commit above, but maybe without a PR to close this issue? Should we close it manually?
yep I will close
The batches used to evaluate the validation error, at least on the multi-head-finetune branch, are not deterministic.
Here is an example output of
num_atoms
from the first batch passed toweighted_mean_squared_error_energy
from sequential epochs (24 and 25) of a runand
The suspected cause (@RokasEl in the slack) is https://github.com/ACEsuit/mace/blob/346999c8eeefd806d882cb4c978ff52a7fff625d/mace/cli/run_train.py#L437