facebookresearch / nevergrad

A Python toolbox for performing gradient-free optimization
https://facebookresearch.github.io/nevergrad/
MIT License
3.93k stars 352 forks source link

Ask-and-tell interface outputting repeated points #976

Open LucasCampos opened 3 years ago

LucasCampos commented 3 years ago

I was simulating a optimization which has four workers (GPUs), which runs a function (fake_training) asynchronously over 20 days. Each new day, the four workers would get four new points to study. However, after a few days, the optimizer starts outputting the same point for two workers.

There is no noise in the target-function. I am using nevergrad version 0.4.2_post5.

Steps to reproduce

  1. Run the code below
  2. Check 1st and 3rd lines of day 9

Observed Results

On day 9, the results for worker 0 and 2 are the same, as seen below (full log attached)

Day: 9
Point for worker 0: {'learning_rate': 0.05191107915156722, 'batch_size': 2, 'learning_decay': 0.8634180716999259, 'learning_decay_steps': 4500} . Simulated loss: 0.2328946307866975
Point for worker 1: {'learning_rate': 0.05191107915156722, 'batch_size': 2, 'learning_decay': 0.8634180716999259, 'learning_decay_steps': 4242} . Simulated loss: 66564.2328946308
Point for worker 2: {'learning_rate': 0.05191107915156722, 'batch_size': 2, 'learning_decay': 0.8634180716999259, 'learning_decay_steps': 4500} . Simulated loss: 0.2328946307866975
Point for worker 3: {'learning_rate': 0.05191107915156722, 'batch_size': 1, 'learning_decay': 0.8634180716999259, 'learning_decay_steps': 4500} . Simulated loss: 1.2328946307866975

Full log: log.txt

Expected Results

I would have expected that each worker receives a completely independent phase-space point.

Relevant Code

#! /usr/bin/env python

import nevergrad as ng
import numpy as np

def fake_training(learning_rate: float, batch_size: int, learning_decay: float, learning_decay_steps: int) -> float:
    return 10*(learning_rate - 0.2)**2 + (batch_size - 2)**2 + (learning_decay_steps - 4500)**2 + (learning_decay-0.98)**2

np.random.seed(320)
parametrization = ng.p.Instrumentation(
    learning_rate=ng.p.Log(lower=0.0001, upper=1.0),
    batch_size=ng.p.Scalar(lower=1, upper=2).set_integer_casting(),
    learning_decay = ng.p.Scalar(lower=0.5, upper=1),
    learning_decay_steps = ng.p.Scalar(lower=4000, upper=5000).set_integer_casting()
)

gpus = 4
budget_days = 20
bug_days = 9
budget = gpus*budget_days
optim = ng.optimizers.NGOpt(parametrization=parametrization, budget=budget)

for day in range(bug_days):
    # Get new points
    inputs = []
    losses = []
    print(f"Day: {day+1}")
    for g in range(gpus):
        x = optim.ask()
        y = fake_training(**x.kwargs)
        inputs.append(x)
        losses.append(y)

        print(f"Point for worker {g}:", x.kwargs, ". Simulated loss:", y)

    for x, y in zip(inputs, losses):
        optim.tell(x,y)
LucasCampos commented 3 years ago

Also worth of notice, the point in question was also suggested in a previous batch (day 5), as can be found in the log

Day: 5
(...)
Point for worker 3: {'learning_rate': 0.05191107915156722, 'batch_size': 2, 'learning_decay': 0.8634180716999259, 'learning_decay_steps': 4500} . Simulated loss: 0.2328946307866975
jrapin commented 3 years ago

Hi @LucasCampos, thanks for bringing this up @teytaud NGOpt uses DoubleFastGADiscreteOnePlusOne in this case (dimension 4, discontinuous, non-noisy). Can it happen that it does not mutate anything from time to time? I don't think that it is expected.

teytaud commented 3 years ago

This is clearly not expected. Thanks a lot @LucasCampos , I investigate immediately.

teytaud commented 3 years ago

Ok the diagnostic is simple: in the parallel case, nothing prevents two mutations from being exactly equal. We prevent the child from being different from the parent, not from being different from other children. A fix should be made in base.py, using "if deterministic then asking twice the same point is pointless".

teytaud commented 3 years ago

https://github.com/facebookresearch/nevergrad/pull/1001/files

jrapin commented 3 years ago

@LucasCampos just to let you know, I'll take some time to review @teytaud 's PR because of the whole kind of problems it can also arise (solving this for deterministic functions could break other stuff on noisy functions so it's a bit tricky) Also caching on our side the past values for instance could become too big to handle in some cases, depending on the applications, so there's no easy one-size-fit-all solution :s

LucasCampos commented 3 years ago

@jrapin, this is not a big problem for me. Is there any way I, as a non-dev, can help with solving the issue?

Thanks for taking this issue seriously.