CMA-ES / libcmaes

libcmaes is a multithreaded C++11 library with Python bindings for high performance blackbox stochastic optimization using the CMA-ES algorithm for Covariance Matrix Adaptation Evolution Strategy
Other
321 stars 78 forks source link

sep-(a)CMA-ES does escape the solution basin on fsphere in high dimension with gradient injection #97

Closed beniz closed 9 years ago

beniz commented 9 years ago

First, comparing sep-aCMA-ES with and without gradient injection (GI), GI allows fast convergence: sepacma_fsphere_1m_comp The difference in initial value is due to not showing iteration 0 (will be ticketed and fixed).

However, a closer look shows that while nearing the optima, sep-aCMA-ES with GI escapes the solution basin and stagnates around 5000 or so in 1M-D. Below is a logscale of the objective function value on the same run as above: sepacma_fsphere_1m_gradient_pb Same behavior is witnessed with sep-CMA-ES.

To reproduce:

./test_functions -fname fsphere -alg sepacmaes -ftarget 1e-8 -dim 1000000 -with_gradient -no_stagnation -max_hist 1000 -x0 1 -sigma0 0.1

This might be a bug, in all cases, this is a weird behavior.

nikohansen commented 9 years ago

It must already be a bug in "pure" separable CMA-ES, as it is supposed to converge. I also suggest to always plot the y-axis in log-scale. For diagnosis, the standard output with step-size(s), eigenvalues etc is advisable.

EDIT: I didn't consider that it was 1M-D, in which case this should be considered expected behavior, if the initial step-size is large.

beniz commented 9 years ago

Yes, sigma moves down by ~1e-5, and the condition number up by ~1e-7 in both cases. I can't see any difference before and after step 60 when GI is activated.

Since this is in 1M-D, I will try to reproduce in lower dimension, which would simplify plotting of eigen and param values as well.

Sep was tested on BBOB and on neural nets of up to ~400-D without experiencing this behavior.

I will report here as it goes...

loshchil commented 9 years ago

modification of the initial step-size and a longer run should help

On Thu, Jan 8, 2015 at 3:20 PM, Emmanuel Benazera notifications@github.com wrote:

Yes, sigma moves down by ~1e-5, and the condition number up by ~1e-7 in both cases. I can't see any difference before and after step 60 when GI is activated.

Since this is in 1M-D, I will try to reproduce in lower dimension, which would simplify plotting of eigen and param values as well.

Sep was tested on BBOB and on neural nets of up to ~400-D without experiencing this behavior.

I will report here as it goes...

— Reply to this email directly or view it on GitHub https://github.com/beniz/libcmaes/issues/97#issuecomment-69184504.

beniz commented 9 years ago

modification of the initial step-size and a longer run should help

playing with the initial step-size certainly allows to trigger the problem:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 1 -no_stagnation

immediately diverges, while

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation

I'm investigating whereas this could be due to a bug. I am worried that users might try in high dimensions and reject the whole thing while a change in sigma would do.

beniz commented 9 years ago

So, a few more experiments in lower dims show that on sufficiently long runs, sigma lowers and the search takes the right direction again.

So I am thinking that it could make sense to have a trick that tries a larger than usual change in sigma when the search goes for a certain number of steps in the wrong direction ?

nikohansen commented 9 years ago

I see, then "initial elitism" might be a good try, and you are right, additionally decreasing sigma by a factor of 2 until the first success is observed is probably quite effective.

One general problem is that cumulative step-size adaptation has a horizon that depends on the dimension (though it is possible to modify this dependency down to sqrt dimension by setting cs accordingly). The parameters or even the method might not be perfectly suited for 1M-D problems. Two-point adaptation is a promising alternative candidate, where it should also be simpler to adjust the parameters.

To get the full picture: what is the population size?

beniz commented 9 years ago

To get the full picture: what is the population size?

lambda = 45

then "initial elitism" might be a good try

The current implementation of initial elitism enables after the run ends. In high dim, an earlier trigger could be useful.

nikohansen commented 9 years ago

To me it seems generally useful even without restarts. Actually, it seems particularly useful for a user-supplied initial point to be "elitist" for a single run.

nikohansen commented 9 years ago

It remains to be understood why the gradient approach stagnates in the end?

beniz commented 9 years ago

Yes, I may need to run it longer. I will post here results from a set of experiments.

beniz commented 9 years ago

So, I am able to report on three tiny fronts: 1- the stagnation of the gradient approach after a good dip toward the optima (i.e the original pb this ticket started on); 2- an elitist x0 scheme where x0 is reinjected as long as half of candidates are above f(x0); 3- a sigma step-control scheme (division by half) whenever the search appears to be going in the wrong direction for a number of steps.

First, elitist x0 (the 2-) does not appear to work always well: in some reproducible cases, the search immediately stales at f(x0) value unless lambda is increased. This may be that somehow the reinjection prevents the cov matrix to shape such that a direction appears ?

Second, I have tried 3 on 1, with mitigated results, that may indicate either a bug in gradient injection and/or that scheme 3 needs more careful control.

Due to the linearity of github comments, I will separate and results into different comments.

beniz commented 9 years ago

1- gradient injection scheme stagnation

beniz commented 9 years ago

2-elitist x0 This is a simple modification of 'initial elitism', available for testing on branch 'sep_97', that applies elitism on first run to x0, i.e. it reinjects x0 until half of the candidate fvalues are < f(x0). It is activated by using -elitist 2, while -elitist 1 applies the 'original' behavior, i.e. re-injection of previous run's best candidate.

While the scheme sometimes works, it appears to stagnate indefinitely in certain cases. Typically, taking the diverging case of a few comments above, applying the new scheme:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 1 -no_stagnation -elitist 2

which yields:

INFO - iter=245 / evals=7105 / f-value=5000 / sigma=0.628006445685068 / last_iter=8
INFO - iter=246 / evals=7134 / f-value=5000 / sigma=0.62681772456298 / last_iter=8
INFO - iter=247 / evals=7163 / f-value=5000 / sigma=0.62563139059145 / last_iter=8
INFO - iter=248 / evals=7192 / f-value=5000 / sigma=0.62444528509015 / last_iter=8
...

The scheme appears to even block convergence on a an otherwise converging example:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -elitist 2
beniz commented 9 years ago

3- sigma (crude) step-control This scheme halves sigma is search goes into the wrong direction for a number of step. In the following the number of steps has been arbitrarily set to 10. It is activated by using the -stigma_stepc option in 'test_functions'.

Tested two tests for triggering the scheme: 1- a test of whether f(x_cand) > f(x0) : this works but can fail when the test triggers repeatedly and lowers sigma too much to the point where f(xcand) stagnates while above f(x0). 2- a test of whether f(x{t-1}) < f(x_t) for a number of steps: this appears to work better on my tiny set of experiments, though there's always the risk of lowering sigma to the point of stagnation.

Using test 2 above:

to reproduce:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 1 -no_stagnation -sigma_stepc
./test_functions -fname fsphere -alg sepacmaes -ftarget 1e-8 -dim 1000000 -with_gradient -no_stagnation -max_hist 1000 -x0 1 -sigma0 0.1

In all cases, the latter behavior is my next point of investigation. Any tip on how to understand this better is welcome.

beniz commented 9 years ago

One general problem is that cumulative step-size adaptation has a horizon that depends on the dimension (though it is possible to modify this dependency down to sqrt dimension by setting cs accordingly). The parameters or even the method might not be perfectly suited for 1M-D problems. Two-point adaptation is a promising alternative candidate, where it should also be simpler to adjust the parameters.

Shall we think of a 'set_high' (much as the 'set_noisy') function that would attempt to set a slightly improved set of parameters for high dimensions ? If this appears feasible, I could open a dedicated ticket and prepare the required range of simulations for selecting the parameter values...

nikohansen commented 9 years ago

1- gradient injection scheme stagnation: Try with different initial step-sizes. My prediction is that the y-location of the spike corresponds to the step-size. This would mean, that the reason is that step-size cannot converge as fast to zero as the gradient can point the algorithm to the optimum. Not a surprise, because, the gradient makes effectively 5000 evaluations in one iteration.

The best resolution is probably to apply TPA instead of CSA and use a constant damping parameter instead a dimension-dependent setting.

beniz commented 9 years ago

Not a surprise, because, the gradient makes effectively 5000 evaluations in one iteration.

Understood. Just FTR, I am using the function derivative to avoid the cost per iteration.

The best resolution is probably to apply TPA instead of CSA and use a constant damping parameter instead a dimension-dependent setting.

OK, I'll make that ticket higher on my list.

nikohansen commented 9 years ago

3- sigma (crude) step-control: I don't see anything wrong with the test cases. Just the same "problem" as before: the expected convergence rate (steepness of the f-graph) of the default strategy is roughly 1 order of magnitude for 10*dimension evaluations.

In the second example, the step-size is in the end IMHO far too large, because the gradient quickly shifted the x-value quickly much closer to the optimum, which renders a given optimal step-size to be too large. Same as the spike effect seen in the other experiments.

nikohansen commented 9 years ago

instead of

cs = (mueff + 2) / (N + mueff + 3)

we could use

cs = (sqrt(mueff) + 2) / (sqrt(N) + sqrt(mueff) + 3)

as alternative setting. It reduces the time horizon to sqrt(dimension), but I don't think it solves the problem to facilitate quicker step-size decrements.

nikohansen commented 9 years ago

2-elitist x0: I can't reproduce the "blocking" behavior.

INFO - iter=100 / evals=3600 / f-value=50000 / sigma=0.0976089129404275 / last_iter=92
.
.
.
INFO - iter=8100 / evals=291600 / f-value=9053.62436620463 / sigma=0.0435240965572909 / last_iter=106

It seems just to be about as slow as I expect (I didn't check the rate precisely).

beniz commented 9 years ago

2-elitist x0: I can't reproduce the "blocking" behavior.

OK, thanks for pointing this out, my appreciation of convergence rate is still very much blurry.

At first I thought that elitist-x0 could be a default option, but then I witnessed that it was slowing (or hampering) convergence on the same run with sigma=0.1:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -elitist 2

compared to:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -elitist 2

Initially I thought that increasing lambda in this particular case would limit the impact of the reinjected x0, but it doesn't appear to do so. I still have difficulties grasping this one.

Note that elitist-x0 does re-inject a single instance of the initial candidate.

beniz commented 9 years ago

1- gradient injection scheme stagnation: Try with different initial step-sizes. My prediction is that the y-location of the spike corresponds to the step-size. This would mean, that the reason is that step-size cannot converge as fast to zero as the gradient can point the algorithm to the optimum. Not a surprise, because, the gradient makes effectively 5000 evaluations in one iteration.

I believe you are correct, see below: sep_97_gi_comp

beniz commented 9 years ago

FYI, this whole thing started because I had noted from our chat, to double-check whether there was a bug in GI in high dim as I could not witness any difference with and without GI on solving neural nets for instance.

In practice, my opinion is that if the gradient is available for a high dimensional problem and can be injected even every x steps, it would make sense to mitigate the problem above.

I can try to see what comes out of injecting the gradient less often, which would also in practice be a computational gain. There are other simple things I am willing to try out, such as reinjecting the best solution so far (instead of waiting til the end of the run).

nikohansen commented 9 years ago

Right, exactly as predicted (the kink at the end of the right flank is supposed to correspond to step-size).

The most useful mitigation is to enable quick decrease of the step-size. I think reinjection of the best will only help if mu=1, probably not even then so much. Of course in any case the graphs for the current best might look better (depending on what exactly is plotted), but this is rather of cosmetic nature.

beniz commented 9 years ago

we could use

cs = (sqrt(mueff) + 2) / (sqrt(N) + sqrt(mueff) + 3)

Interesting. It does appear to work better in general on the initial problem. In practice it seems to report the low spike in y to the first couple of steps of the algorithm. With initial sigma0=1 this early spike is so acute that it triggers the stopping criteria (e.g. going < 1e-8).

Same experiments as before:

sep_97_gi_comp_ncs

EDIT: fixed the log-scale

beniz commented 9 years ago

Of course in any case the graphs for the current best might look better (depending on what exactly is plotted), but this is rather of cosmetic nature.

Yes, expected that as well. The bumps in the curves might simply be replaced by flat (reinjected) values.

nikohansen commented 9 years ago

very interesting and somewhat surprising, looks almost like problem virtually solved, given we see about 1e4 evaluations in 1M-D. We probably could still improve quite a bit with TPA (I will try to find some time to cross check this with my implementation, though in smaller dimension).

EDIT: it was 5K-D not 1M-D, so no conclusion for can be drawn for the latter case.

beniz commented 9 years ago

very interesting and somewhat surprising, looks almost like problem virtually solved, given we see about 1e4 evaluations in 1M-D.

Hum, if you are referring to the change of csigma 'horizon' last graph reports in 5K-D unfortunately. I've tried in 1M-D with the updated csigma, and while the spike is closer to the optima, the rebound appears to jump much higher. I will post a graph of a re-run. Here: sepacma_1m_gi_ncs

with sigma (green) and condition number (red): sepacma_fsphere_1md_gradient_ncs

I will focus on the TPA scheme as a next step.

Also, I did run into this yesterday, not that I believe it may be of direct help here, but FTR: http://www.reddit.com/r/MachineLearning/comments/2rzmj4/3_new_adaptive_sgd_papers_in_iclr_2015/

EDIT: graphs, sigma seems to move faster now.

beniz commented 9 years ago

Below is a run with TPA (see #88 for updates), it is to be compared with the set of runs on fsphere in 5K from two days ago.

sepcma_fsphere_5k_tpa

It is reproducible from branch 'tpa_88' with:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -with_gradient -tpa
beniz commented 9 years ago

And this is the 1M-D run with TPA, to be compared with the last 1M-D run from two days ago. sepacma_fsphere_1md_gradient_tpa

nikohansen commented 9 years ago

The convergence speed seems rather independent of dimension. Do I interpret this correctly? Generally: could you add an x-annotation, such that it is clear whether it is evaluations or iterations?

nikohansen commented 9 years ago

I suggest two experiments: (1) setting c_sigma=1 (instead 0.3), to check whether this is the source of the oscillations. (2) setting d_sigma=10 (instead of sqrt(dimension)), and (2a) maybe both at the same time.

beniz commented 9 years ago

Generally: could you add an x-annotation, such that it is clear whether it is evaluations or iterations?

Yes, I've committed a fix. This is all evaluations except on the two initial plots that the ticket was created from and the multi-plots with various values of sigma. Sorry about that, not using the same tools.

EDIT: fix

nikohansen commented 9 years ago

When I run TPA-ES (without CMA, with gradient injection) in dimension 5K with my Python code, I get: screen shot 2015-01-14 at 18 24 54 and with d_sigma=5: screen shot 2015-01-14 at 18 03 23 In particular I don't see these oscillations. Did you check that it small dimension everything is fine?

beniz commented 9 years ago

I am puzzled, in 5K it looks worse (than with CSA and modified damping), in 1M it looks better. Also seems the convergence speed rather independent of dimension. Do I interpret this correctly?

My understanding is that it is much better with TPA in both cases.

For clarity, below are the two plots in 5K-D with sep-CMA-ES:

sepcma_fsphere_5k_tpa2

EDIT: reproduce TPA on 5K-D from branch 'tpa_88' with:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -with_gradient -tpa
nikohansen commented 9 years ago

Agreed. The oscillations are worrying, they might indicate a bug.

beniz commented 9 years ago

In particular I don't see these oscillations.

My code is young, there could be a bug in current TPA. I was not worried about the oscillations because they looked just like the mitigation of the original problem... will continue to investigate.

EDIT: also, I am already making sure that the gradient injection does not mix with the points used for the TPA.

beniz commented 9 years ago

Here TPA / dsigma=5 / 5K-D / sepcma / gradient injection: sepcma_fsphere_5k_tpa_dsig5

EDIT: could oscillations be due to the factor to z ? I am using the E||N(0,I)||.

nikohansen commented 9 years ago

Cross-check whether they are related to dimension, i.e. run in 20-D. Then see above suggested experiment with c_sigma=1. I cannot remember to have ever seen them and I don't see how they could be related to a certain normalization.

nikohansen commented 9 years ago

FYI, I get something very similar as you if I set d_sigma to 1.

beniz commented 9 years ago

FYI, I get something very similar as you if I set d_sigma to 1.

Right to the point. I did fix the init of d_sigma. The bug was making use of the CMA value of d_sigma. I now can reproduce your code's results (branch 'tpa_88').

I did investigate several little changes in the scaling of z points and d_sigma, but did not find anything very exciting. Only that changing d_sigma sometimes highly affects the convergence speed.

I am wondering if it'd make sense to have a dedicated default value of d_sigma when gradient injection is active.

Also, I believe I will be able to close this ticket. For now the default with CMA is to activate the update value of csigma above 1000-D.

Finally, I can run tests in high dimension for TPA to see how best to set d_sigma in this case. I will report about this in ticket #88.

beniz commented 9 years ago

FTR, sepacmaes / 5K-D/ TPA / Gradient injection:

beniz commented 9 years ago

I'm bringing both the new csigma value (active above 1000-D) to 'dev' branch (readying it for next release).

I will close once tpa makes it as well to 'dev' branch.

nikohansen commented 9 years ago

I did a few experiments and

d_sigma = 4 - 3.6 / sqrt(dim) 

seems a valid choice for TPA, which (a) reflects the experiments in the original paper up to 200-D about as well as sqrt(dim) and (b) is a more sensible choice with popsize=dim or with gradient injections, because it reflects the possible convergence rate in this case.

nikohansen commented 9 years ago

For the record, I don't see anything wrong making

cs = (sqrt(mueff) + 2) / (sqrt(N) + sqrt(mueff) + 3)

the general default setting. AFAIK there is no empirical or theoretical evidence that ~ N^-1/2 is a better or worse choice than ~ N^-1 other than what we have found here.

beniz commented 9 years ago

Thanks. FYI I have a set of experiments still running for determining best d_sigma in high dimensions. I will post the plots in the tpa ticket for cma/sepcma with/without gradient injection. This will allow to make sure we get similar plots as well.

beniz commented 9 years ago

AFAIK there is no empirical or theoretical evidence that ~ N^-1/2 is a better or worse choice than ~ N^-1 other than what we have found here.

OK, well noted. I've noticed the other day that it was the value already used in VD-CMA.

nikohansen commented 9 years ago

Good point, the underlying reasoning is that in default CMA we have c_mu ~ N^-2, while in VD-CMA, and sep-CMA we have c_mu ~ N^-1 which suggests to also reduce the time horizon for the cumulation (cc and cs) to keep a gap. I use

cc = (1 + 1 / N + mueff / N) / (N**0.5 + 1 / N + 2 * mueff / N)

in the sepCMA case in the Python code. Unfortunately, I can't give a reasoning for the different mueff-corrections for cc and cs out of my head.

beniz commented 9 years ago

I'm trying to wrap this ticket up, and my understanding is that this issue is solved by TPA.

Taking the initial problem of sphere in 1M-D with sep-aCMA-ES and gradient injection, and given the change to cmu operated in #88 for sep without TPA as well as the new default of TPA's dsigma (https://github.com/beniz/libcmaes/issues/97#issuecomment-70637205), here are the current results:

./test_functions -fname fsphere -alg sepacmaes -ftarget 1e-8 -dim 1000000 -with_gradient -no_stagnation -max_hist 1000 -x0 1 -sigma0 0.1
./test_functions -fname fsphere -alg sepacmaes -ftarget 1e-8 -dim 1000000 -with_gradient -no_stagnation -max_hist 1000 -x0 1 -sigma0 0.1 -tpa 2

The oscillations in the latter (TPA) are observed here in the case of 1M-D but not in the case of 5K-D (I've checked again just right now). Not sure is this qualifies as an issue.

EDIT: added mention of gradient injection

beniz commented 9 years ago

Solved with TPA, closing for now.