multi-head-finetune validation batches not deterministic

bernstei commented 3 weeks ago

The batches used to evaluate the validation error, at least on the multi-head-finetune branch, are not deterministic.

Here is an example output of num_atoms from the first batch passed to weighted_mean_squared_error_energy from sequential epochs (24 and 25) of a run

BOB weighted_mean_squared_error_energy num_atoms tensor([ 16,  60,  54,   7,  48,  19,  56,  16,  88,  23,  36,   4,  28,  19,
         14,  12,  40,   8,  12, 100,   6,  56,  40,  52,  46,  36,   5, 144,
          4,  16,  19, 100], device='cuda:0')

and

BOB weighted_mean_squared_error_energy num_atoms tensor([ 36,  19,  40,  16,  56,  54,   8,  36,  46, 100,  24,   4,   1,   4,
         36,  22, 128,   8,   7,  28,  12,  14, 100,  19,  56,  37, 144,   8,
         16,   8,  50,  40], device='cuda:0')

The suspected cause (@RokasEl in the slack) is https://github.com/ACEsuit/mace/blob/346999c8eeefd806d882cb4c978ff52a7fff625d/mace/cli/run_train.py#L437

bernstei commented 3 weeks ago

Was this closed by 122a50c ?

bernstei commented 3 weeks ago

@ilyes319 you seem to have fixed it in the commit above, but maybe without a PR to close this issue? Should we close it manually?

ilyes319 commented 3 weeks ago

yep I will close

ACEsuit / mace

multi-head-finetune validation batches not deterministic #450