Open LucasCampos opened 3 years ago
Also worth of notice, the point in question was also suggested in a previous batch (day 5), as can be found in the log
Day: 5
(...)
Point for worker 3: {'learning_rate': 0.05191107915156722, 'batch_size': 2, 'learning_decay': 0.8634180716999259, 'learning_decay_steps': 4500} . Simulated loss: 0.2328946307866975
Hi @LucasCampos, thanks for bringing this up
@teytaud NGOpt uses DoubleFastGADiscreteOnePlusOne
in this case (dimension 4, discontinuous, non-noisy). Can it happen that it does not mutate anything from time to time? I don't think that it is expected.
This is clearly not expected. Thanks a lot @LucasCampos , I investigate immediately.
Ok the diagnostic is simple: in the parallel case, nothing prevents two mutations from being exactly equal. We prevent the child from being different from the parent, not from being different from other children. A fix should be made in base.py, using "if deterministic then asking twice the same point is pointless".
@LucasCampos just to let you know, I'll take some time to review @teytaud 's PR because of the whole kind of problems it can also arise (solving this for deterministic functions could break other stuff on noisy functions so it's a bit tricky) Also caching on our side the past values for instance could become too big to handle in some cases, depending on the applications, so there's no easy one-size-fit-all solution :s
@jrapin, this is not a big problem for me. Is there any way I, as a non-dev, can help with solving the issue?
Thanks for taking this issue seriously.
I was simulating a optimization which has four workers (GPUs), which runs a function (
fake_training
) asynchronously over 20 days. Each new day, the four workers would get four new points to study. However, after a few days, the optimizer starts outputting the same point for two workers.There is no noise in the target-function. I am using nevergrad version 0.4.2_post5.
Steps to reproduce
Observed Results
On day 9, the results for worker 0 and 2 are the same, as seen below (full log attached)
Full log: log.txt
Expected Results
I would have expected that each worker receives a completely independent phase-space point.
Relevant Code