Open prckent opened 2 years ago
I would consider to change the current behavior. Each subjob should run as if it is running standalone.
Each subjob should run as if it is running standalone.
This was my expectation and what I think we should change the code to do as well.
By default, each group should use a different seed (current operation satisfies this already, should remain true even if user provides a seed as this is the intended production use).
AFAIK, a given combination of (seed,#mpi) is deterministically reproducible. This clears the minimum bar for deterministic tests IMO (and also adds some entropy which deterministic tests are already short on).
If we want to be able to have each run produce identical results, I request a new input flag be added (identical_ensemble
?) to support this since the use case is really only for testing and it would mess up production runs for people who have been providing a seed.
If we want each MPI group to be reproducible independent of ensemble size (i.e. g000 always produces the same results regardless of ensemble size), i would recommend the following: 1) have group 0 always use "seed" as provided, 2) have group 0 generate a list of random seeds for groups 1:N-1 and distribute, 3) have all groups reset their respective seed and then proceed with the run. This way two runs on M and N groups with M<=N would always produce matching results for groups 0:M-1.
When the seed is not give, how to initialize the seed for each group can be discussed separately. When the seed is given in the input file, it should be taken for the group instead of only the first group respecting the input.
Consider there is possibly that some group has seed input and some don't, I don't see a good reason to assume the all the groups can collectively decide about how to arrange seeds.
The use cases are either twist averaging or an ensemble of similar molecules. In either production case, a different seed per process is desired in production runs (priority).
Reproducibility with a provided seed is also sometimes desired in production, and this is also covered by current functionality.
The only benefit I see from making changes is to make testing easier. This can be done without messing other things up that already work (i.e. by not adding statistical correlation in the production ensemble by default).
Making the entire user base provide distinct seeds in the "ensemble" input files (required to get reproducible and statistically correct production results) to make writing a handful of tests easier is not a good move.
Our efforts would be better spent trying to merge down to a single input file for ensemble runs rather than increasing the divergence between the current multiple input files in use.
The purpose of this issue is to discuss if the following behavior is to be considered as a bug or a feature, and then what to do/not do about it.
As far a testing and the above comments go, I think we can reasonably add some deterministic and statistical tests without any major work and without changing the C++. This would be enough to verify that the feature is nominally working. e.g. Checking the expected number of samples is obtained is easily done deterministically. And we should document the current algorithm to reduce surprises.
=> We don't have to make any changes to the algorithm now. Jaron's comments do remind me that the random number initialization in non-ensemble runs is more of a problem. If we had a better algorithm for this then concern about introduced correlations would be less. (e.g. https://numpy.org/doc/stable/reference/random/parallel.html lists multiple approaches)
The purpose of this issue is to discuss if the following behavior is to be considered as a bug or a feature, and then what to do/not do about it.
While setting up minimal tests for the ensemble / batched run functionality (#4091 , #4093), I noticed that the treatment of random number seeds is different in ensemble runs. The merged PRs currently check only for a crash and the statistical results are not yet verified.
The key difference is that in an ensemble run the seeding &/or random number use is different so that with a fixed seed and the same inputs, every run in the ensemble will do a distinct QMC run. This applies even to the first input which will give different results from when run independently. The results also depend on the size of the ensemble.
This historical choice has the consequence that none of the deterministic tests can be used to check ensemble runs, and more generally that someone using fixed seeds for reproducibility and only using ensembles for HPC throughput reasons will not get the results that they expect.
Hopefully no one has been caught out by this. Clearly the behavior needs to be documented. The question is then whether we should change the behavior and what behavior(s) would best suit different workflows.
The following illustrates the behavior with different ensemble sizes. If the seeds were treated consistently, every energy would be -10.528057.