Two-points adaptation for CMA-ES

beniz commented 9 years ago

See:

'How to Assess Step-Size Adaptation Mechanisms in Randomised Search': https://hal.inria.fr/hal-00997294v3/document
CMA-ES with Two-Point Step-Size Adaptation: http://arxiv.org/abs/0805.0231

beniz commented 9 years ago

FTR, could be useful for #97

beniz commented 9 years ago

TPA from paper https://hal.inria.fr/hal-00997294v3/document is now available on branch 'tpa_88'. It is compatible with both active CMA and gradient injection. It has been implemented within the ask/eval/tell framework.

Preliminary results (under an early 'no-bug' assumption) are very good, especially convergence, and I will report more thoroughly shortly.

nikohansen commented 9 years ago

FTR, as it belongs here,

d_sigma = 4 - 3.6 / sqrt(dim)

seems a valid choice for TPA, which (a) reflects the experiments in the original paper up to 200-D about as well as sqrt(dim) and (b) is a more sensible choice with popsize=dim or with gradient injections, because it reflects the possible convergence rate in this case.

beniz commented 9 years ago

New dsigma value now in code.

I see several choices for importing TPA into production code (these are OR bullet points):

make TPA default over CMA
make TPA default for sep-CMA and VD-CMA with and/or without gradient injection (in order to fix #97 in the later case)
leave TPA as an option for now

And btw, there's an option to specify dsigma for TPA by hand.

Runs on BBOB will allow to assess overall performances.

nikohansen commented 9 years ago

My current favorite would be to make it default for sep-CMA, for VD-CMA, and when gradient injection is applied.

EDIT: and when (full) elitism is used, which is not implemented in the lib yet.

beniz commented 9 years ago

I do witness a problem with TPA + active CMA-ES (without and with gradient). It does not converge anymore to the optima on the standard test functions such as Rosenbrock and rotated Elli in all dimensions (i.e. starting with 2-D).

I've traced the source of the problem to the customized (opposite) points 'injected' in place of two (originally sampled) candidates. The problem goes away with a significant increase of lambda (>> +2).

To reproduce:

./test_functions -fname rosenbrock -dim 10 -ftarget 1e-8 -alg acmaes -no_stagnation -tpa

which yields something similar to: acmaes_tpa_10d

beniz commented 9 years ago

More testing reveals that TPA is affecting the C_mu^- update of active CMA (using (1)), which appears to be the culprit here: canceling the update in C_mu^- removes the oddity. Will continue to investigate.

(1) "Benchmarking a Weighted Negative Covariance Matrix Update on the BBOB-2010 Noiseless Testbed"

nikohansen commented 9 years ago

This is in contradiction to my results (it works fine for me). We see that the variance in the covariance matrix goes quickly downwards, while the step-size does not, while it should rather be the other way around. My hunch is that the length normalization of the TPA samples is not correct (it's quite tricky to accomplish correctly), the steps are too long and therefore ranked low and therefore produce a strong negative update.

beniz commented 9 years ago

Very useful thank you! Will report here, as usual.

nikohansen commented 9 years ago

FTR, I use

y = mean - mean_old
y *= sum(self.randn(len(y))**2)**0.5 / self.mahalanobis_norm(y)
injected_x = mean + y

with

def mahalanobis_norm(self, dx):
    """compute the Mahalanobis norm that is induced by the adapted
    sample distribution, covariance matrix ``C`` times ``sigma**2``.
    The expected Mahalanobis distance to
    the sample mean is about ``sqrt(dimension)``.
    """
    return sqrt(sum((self.D**-1. * np.dot(self.B.T, dx))**2)) / self.sigma

i.e., the new Mahalanobis norm is approximately sqrt(dimension), as for any other sample.

My bad, as this has not been explicitly documented anywhere so far (it is probably the only way to make it work though).

beniz commented 9 years ago

Thanks, got it. I've implemented it both for full and diagonal cov matrix. I've tried several approximations as well, but sqrt(dim) does not fare well with sep-CMA, nor does ~dim. Therefore for now this is using the full mahalonobis computation.

In order to uncover any bug and be sure to set TPA as default where it fares the best, I've run a slightly more thorough examination (which can be extended to higher dimensions).

Below are f-evaluations/successes for various algorithms (the number of successful runs over 50 runs is reported in brackets) with/without gradient and with/without TPA (f-target is 1e-8, stagnation is deactivated).

In summary, TPA dominates in low-dim for fsphere with gradient. From more limited runs, I know that results are different in higher dimensions (e.g. 5000-D on rosenbrock), but these experiments take much more time to run.

(The misalignment is due to some strange behavior of markdown... trying to clean it up...)

active-CMA on fsphere:

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   728.16 (50)     1056.8 (50)     583.52 (50)     646.88(50)
10  1532.6 (50)     2179.6 (50)     1055.6 (50)     865.8(50)
20  2996.16 (50)    2764.56 (50)    1745.52 (50)    788.4(50)
40  5797.5 (50)     3478.5 (50)     3032.1 (50)     1064.4(50)

active-CMA on rosenbrock:

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   2177.74 (46)        2902.3 (47)     2130.73 (41)        2459.91(45)
10  5859.38 (48)        13378.8 (33)    5533.1 (42)        7595.78(45)
20  20477 (43)          65544.6 (21)    15736.5 (45)       27155.7(43)
40  76284.4 (40)        137940 (39)     49167.8 (48)        87623.8(46)

active-CMA on elli:

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   1761.12 (50)        2644.8 (50)     1486.56 (50)        1878.4(50)
10  4461.4 (50)      7380.4 (50)        3546.4 (50)     4709.4(50)
20  13180.6 (50)        22375.4 (50)        9866.88 (50)        12596.4(50)
40  44848.5 (50)        72888.9 (50)        32805 (50)      37630.8(50)

sep-active-CMA on fsphere

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   657.44 (50)            864.16 (50)      517.28 (50)     608.96(50)
10  1384.6 (50)     1933 (50)               913.6 (50)      1166(50)
20  2687.28 (50)    3786 (50)               1516.56 (50)        1703.04(50)
40  5327.7 (50)               6632.4 (50)       2669.1 (50)     2015.7(50)
100 12495.7 (50)    10894.6 (50)        5687.86 (50)        1354.56(50)

sep-active-CMA on rosenbrock

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   inf (0)                  inf (0)         inf (0)           inf(0)
10  inf (0)                  inf (0)         inf (0)           inf(0)
20  inf (0)                  inf (0)        78946.9 (43)            inf(0)
40  210740 (45)          inf (0)        181238 (38)         inf(0)

sep-active-CMA on elli

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   1298.24 (50)        1938.4 (50)     1136.8 (50)     1491.84(50)
10  2941.8 (50)     4698.4 (50)     2336.2 (50)     3179.4(50)
20  6595.92 (50)        11927.5 (50)        4612.8 (50)     7191.6(50)
40  15636.3 (50)        30640.8 (50)        10174.2 (50)        17035.5(50)

nikohansen commented 9 years ago

The results still don't look correct. My TPA implementation takes about the same number of evaluations on the 20 and 40-D ellipsoid as CSA (around 14000 / 50000 vs 14000 / 45000), ditto on Rosenbrock (where however results are more sensitive to the initial conditions).

It might be useful to cross-check the result on the Rosenbrock function. With x0=0.1, sigma0=0.1 I see only around 60000 evaluations in 40D on successful runs.

nikohansen commented 9 years ago

Just in case this was not so clear: The Mahalanobis length of both TPA-samples is determined by a single drawing of

sum(np.random.randn(dimension)**2)**0.5

In larger dimension it shouldn't make a big difference whether it is only a single instantiation.

beniz commented 9 years ago

Nailed it, thanks!

active-CMA on elli

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   1726.88 (50)        1659.84 (50)        1470.24 (50)        1369.92(50)
10  4472.8 (50)     4432.8 (50)     3580.6 (50)     3165.4(50)
20  13200.7 (50)        14059.4 (50)        9869.76 (50)        8974.32(50)
40  45009 (50)      51042.6 (50)        32723.4 (50)        32014.8(50)

active-CMA on rosenbrock

beniz@haumea:~/research/siminole/dev/libcmaes/tests$ ./tpa_tests -dims 5,10,20,40 -runs 50 -alg acmaes -fname rosenbrock
D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   2128.68 (47)        1995.57 (47)        2012.38 (42)        2048.59(41)
10  5779.59 (49)        5818.48 (46)        6116.5 (40)     5718.95(38)
20  18750.6 (47)        19136.8 (45)        19519.8 (37)        14428.2(48)
40  67591 (45)      69697.8 (44)        58658.4 (41)        61445(42)

beniz commented 9 years ago

In larger dimension it shouldn't make a big difference whether it is only a single instantiation.

Yes, understood (also, I am already using a single drawing for both samples). This is one among a handful of computational improvements on my list then.

FYI, another important one is to keep rank into Candidate object, which requires a structural change that goes beyond just this ticket I believe. Typically, it would allow for faster check of the rank of the two samples of interest here, and it relates to the use of the lib with f-rankings only.

beniz commented 9 years ago

Last commits bring TPA as default to gradient, sep and vd, if at least one is activated. I believe we're getting ready for a merge to 'dev', then into next production release.

beniz commented 9 years ago

TPA is now merged into 'dev', which means it is in line for next release. I am closing this issue, and it can be reopened for any subsequent developments.

nikohansen commented 9 years ago

Do you have the complete results after the bug fix? In my experiments TPA tends to be a little worse than CSA, e.g. also with sep-CMA, unless it is only about realizing fast convergence speed. If you observe the same, maybe it should only become default under gradient injection.

beniz commented 9 years ago

Here is the complement, only up to 40-D. To push further requires a bit of time, let me know if you need 100-D and beyond. Note that gradient here uses the functional form, so there's no impact on the number of fevals with and without tpa. This is using ftarget=1e-8. Runs are reproducible with, for example:

./tests/tpa_tests -dims 5,10,20,40 -runs 50 -alg acmaes -fname fsphere

Functions fsphere, elli and rosenbrock are supported.

active-CMA on fsphere:

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   730.4 (50)      668.48 (50)     586.4 (50)      445.12(50)
10  1525 (50)       1205.6 (50)     1048.4 (50)     644.4(50)
20  2988.48 (50)        1964.88 (50)        1731.84 (50)        815.76(50)
40  5779.8 (50)     3412.8 (50)     3034.8 (50)     1036.5(50)
100     13147.8 (10)            7403.5 (10)             6237.3 (10)             1293.7(10)

sep-active-CMA on fsphere:

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   666.24 (50)     601.92 (50)     520.64 (50)     403.68(50)
10  1356.6 (50)     1079.6 (50)     926.8 (50)      598.4(50)
20  2709.12 (50)        1859.52 (50)        1525.68 (50)        771.84(50)
40  5300.1 (50)     3325.2 (50)     2668.8 (50)     1006.8(50)
100 12480.4 (50)        7555.14 (50)        5687.86 (50)        1229.78(50)

sep-active-CMA on elli:

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   1307.84 (50)        1283.84 (50)        1127.68 (50)        1111.36(50)
10  3008.6 (50)     2810 (50)       2304.6 (50)     2299(50)
20  6627.36 (50)        6550.32 (50)        4686.48 (50)        5567.76(50)
40  15414.9 (50)        16490.1 (50)        10166.1 (50)        23457(50)
100 44242.5 (50)        52542.6 (50)        26145 (50)      inf(0)

sep-active-CMA on rosenbrock:

D   fevals_avg      fevals_avg_tpa      fevals_avg_gi       fevals_avg_gi_tpa
5   inf (0)     inf (0)     inf (0)     inf(0)
10  inf (0)     inf (0)     inf (0)     inf(0)
20  inf (0)     inf (0)     69486.6 (47)        inf(0)
40  194963 (48)     inf (0)     135850 (47)     inf(0)

EDIT: added 100-D elli for sep-active-CMA and I can reproduce the oddity for TPA + gradient (tolX).

nikohansen commented 9 years ago

I just tried to replicate the suspicious entry with sep-active-CMA on felli with my Python implementation. I get roughly (from at least two runs each)

D    fevals_avg_gi fevals_avg_gi_tpa
40       9500        12000
100     27000        60000
200     58000        inf

Two observations: our implementations differ (this might be a small difference though, not necessarily a bug), and there is a relevant problem with TPA, which you observe already in smaller dimension than I do.

screen shot 2015-01-26 at 16 38 15 Figure: Run on felli, sep-CMA with TPA and gradient injection in 200-D. If the code would allow arbitrary numerical precision, it would probably not fail, but the large condition number is undesirable in any case.

The failure seems only to occur with both, gradient and TPA being active.

On Rosenbrock, result inf is rather misleading, because I can make it work simply by removing the termination criteria tolupsigma and maxiter.

I consider the failure to converge to the optimum on the axis-parallel ellipsoid function a major flaw, not as bad but comparable to the failure on the sphere function.

Conclusion: TPA should not become default in production code, before this problem is more deeply investigated.

beniz commented 9 years ago

TPA should not become default in production code, before this problem is more deeply investigated.

Defaults are now deactivated on 'dev' and 'tpa_88' branches. Checking on the difference between implementations is next.

beniz commented 9 years ago

Two observations: our implementations differ (this might be a small difference though, not necessarily a bug)

I have found where the difference comes from: I was using

cmu = min(1 - c1, 2 * (mueff - 2 + 1/mueff) / ( (N+2)^2 + mueff)

instead of (value in your last Python implementation):

cmu = min(1 - c1, (0.3 + mueff - 2 + 1/mueff) / (N + 4 * N^0.5 + mueff/2)

So now elli in 100-D with gradient and TPA converges fine, while the problem is observed on some runs in 200-D. Maybe this can help shed light on the deeper flaw.

nikohansen commented 9 years ago

Excellent, that's another bug-fix then which should improve the performance of sep-CMA-ES in larger dimension remarkably.

The main problem remains to be investigated, I am looking into it.

beniz commented 9 years ago

TPA is working fine AFAIK, so I'm closing this until new details or research has to be merged in.

CMA-ES / libcmaes

Two-points adaptation for CMA-ES #88