Open tlienart opened 4 years ago
Am I using it wrong?
function rosenbrock2d(x)
return (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
end
function obj(p::Dict)
return rosenbrock2d([p[:x1], p[:x2]])
end
space = Dict(:x1 => hp_uniform(:x1, -5, 5), :x2 => hp_uniform(:x2, -5, 5));
fmin(obj, space, 50)
returns
FMin.run!: 50 / 50 trials carried out
Dict{Symbol,Float64} with 2 entries:
:x2 => 3.29693
:x1 => 1.79932
whereas BlackBoxOptim
res = bboptimize(rosenbrock2d; SearchRange = (-5.0, 5.0), NumDimensions = 2)
best_candidate(res)
returns the (accurate)
2-element Array{Float64,1}:
0.9999999999961342
0.9999999999925864
Any thoughts? even with 500 trials we're quite far off.. and this is for a function for which random search will give good results in a few 1000 steps
Here are some more trials
FMin.run!: 50 / 50 trials carried out
50 Dict(:x2 => 0.799543567919683,:x1 => 0.8538950940564665)
FMin.run!: 100 / 100 trials carried out
100 Dict(:x2 => 0.1643004860758952,:x1 => -0.41535766294995335)
FMin.run!: 200 / 200 trials carried out
200 Dict(:x2 => 0.1411307201249039,:x1 => 0.29090448536707836)
FMin.run!: 500 / 500 trials carried out
500 Dict(:x2 => 1.4765354269218587,:x1 => 1.2189874229177695)
FMin.run!: 1000 / 1000 trials carried out
1000 Dict(:x2 => 0.8353650505167384,:x1 => 0.915959493150407)
FMin.run!: 2000 / 2000 trials carried out
2000 Dict(:x2 => 0.9558778608402061,:x1 => 0.9772786544185521)
Not great. I'm going to put the same through the original hyperopt and see what happens.
Results from Python
# environment.yml
name: rosenbrock
channels:
- defaults
dependencies:
- pip
- pip:
- hyperopt ==0.1.2
The code
from hyperopt import hp, fmin, tpe
def rosenbrock2d(x1, x2):
return (1.0 - x1) ** 2 + 100.0 * (x2 - x1 ** 2) ** 2
def obj(p):
return rosenbrock2d(p['x1'], p['x2'])
space = {'x1': hp.uniform('x1', -5, 5), 'x2': hp.uniform('x2', -5, 5)}
result50 = fmin(obj, space, algo=tpe.suggest, max_evals=50)
print('50 ', result50)
result100 = fmin(obj, space, algo=tpe.suggest, max_evals=100)
print('100 ', result100)
result200 = fmin(obj, space, algo=tpe.suggest, max_evals=200)
print('200 ', result200)
result500 = fmin(obj, space, algo=tpe.suggest, max_evals=500)
print('500 ', result500)
result1000 = fmin(obj, space, algo=tpe.suggest, max_evals=1_000)
print('1000 ', result1000)
result2000 = fmin(obj, space, algo=tpe.suggest, max_evals=2_000)
print('2000 ', result2000)
The results
50/50 [00:00<00:00, 122.36it/s, best loss: 9.713418601918795]
50 {'x1': 0.45865419152303843, 'x2': 0.5172897711719768}
100/100 [00:00<00:00, 108.41it/s, best loss: 0.07758009764149705]
100 {'x1': 0.9408197663038045, 'x2': 0.8579245959270194}
200/200 [00:02<00:00, 97.25it/s, best loss: 0.016855271351971703]
200 {'x1': 1.1258791364608522, 'x2': 1.270781430200818}
500/500 [00:06<00:00, 72.80it/s, best loss: 0.004552196135317346]
500 {'x1': 1.0504683012406382, 'x2': 1.1079615383606018}
1000/1000 [00:18<00:00, 54.92it/s, best loss: 0.02637746060378691]
1000 {'x1': 0.9136235711797501, 'x2': 0.8484617831749768}
2000/2000 [00:56<00:00, 35.45it/s, best loss: 0.006896342055002183]
2000 {'x1': 0.9359273194623061, 'x2': 0.8706769238787768}
So our behaviour looks similar to the original.
Now the question is: is this a problem with hyperopt that we've copied over, or is this Rosenbrock task simply not a good demo of the tree-parzen technique?
It could be Rosenbrock but it's a very simple example (especially in 2D). Either way the fact that the behaviour in Python is the same is reassuring :+1:
Before releasing the package though, it would be really useful to define a few examples in more than 1D where TreeParzen does well (and, at least, does better than random search). Maybe that the original package has non-trivial examples where this is the case?
Edit: upon closer inspection, it seems the Python version does better in fact? is this just a random artifact that TreeParzen.jl needs > 1000 rounds to produce a valid estimate whereas in the original lib it's "ok-ish" after ~100?
Rosenbrock is a great acid test for non-convex optimisers, however it might be worth taking a look at higher ND variants, just in case this is a weird artifact of TP in low dimensional spaces (I would still find this odd though, but we know that random seach cannot do well in larger ND spaces so anything doing something sensible should find better results) -- might be time to go back to the original paper. I don't put it past the implementation in hyperopt to simply be wrong (and even a reference implementation provided by TP authors, if such a thing exists)
It's also worth bearing in mind we've an ordinary R^N space here but TP isn't designed to handle only R^N spaces. Perhaps we should stack it up against something with a space containing some discrete axes and see what happens here (I appreciate that this is actually a non-trivial task to come up with a useful space here we can glean something from, which is also why its useful, because it could help us to have a useful acid test going forward)
https://facebookresearch.github.io/nevergrad/benchmarking.html
^ We might be able to use this to get some acid tests, and also have more things to compare with. The library of optimisers in this package is impressive
Started reading TreeParzen paper (it turns out hyperopt is indeed the reference implementation of this if we consider that the primary paper author and the primary package author are the same)
I want to point at a certain element of the paper I came across so far:
We keep Estimation of Distribution (EDA) approach on the discrete part of our input space, where we sample candidate points according to binomial distributions, while we use the Covariance Matrix Adaptation - Evolutionary Strategy (CMAES) for the remaining part of our space (continuous hyper parameters)
Ok, cool.
Lets go find a CMAES implementation in juia ... https://github.com/Staross/JuliaCMAES
Ok, can it be trusted, I don't know. But the README has a few things which immediately grab ones attention:
Test on the Rosenbrock function:
...
iter: 100 fcount: 600 fval: 3.69e-08 axis-ratio: 2.22e+01
...
pmin:
[1.0000052051864698,1.0000108150577227]
fmin:
4.346875072347545e-11
Hey! This didn't struggle with a real valued space -- specifically, rosenbrock ... hmmmmmm To me this does strongly suggest that ... hyperopt is wrong. Somewhere.
Interesting. Yes I don't recall hearing of CMAES anywhere in the hyperopt code.
After having had a chat with Mikey he has convinced me that this is a red herring: i.e. they were not covering this as in "TPE is this way" but "this is how GP styled optimisers might work"
However we've agreed that even in the absence of the tree structure of hyperparameters the probabilistic modelling shouldn't result in such poor optimising... still more pontificating to be done.
https://en.wikipedia.org/wiki/Test_functions_for_optimization
^ Theres a lot of fun to be had here
@yalwan-iqvia Please can you write up your previous findings into a new issue on https://github.com/IQVIA-ML/TreeParzen.jl
An attempt to summarise the findings so far:
It looks like that although not explicitly stated, the modelling system from the paper assumes conditional independence of the variables being optimised.
The implementation fits distributions to each variable in a marginal fashion (i.e. univariate fitting) . Even siblings from the same parent node in the tree structure (if any) are fit this way. Let us take an example for a 2d distribution (such as rosenbrock or parabola): We've fit a distribution to our X component, and now we go about immediately sampling from X. Rather than fitting a distribution to Y and then sampling from each and finding the pair with the best EI, we select first the X with the best EI and then given this X select also the Y with the best EI. Note that the only way to actually fit this would be to model X,Y jointly and sample from that distribution (which I don't deny is nontrivial)
It seems to be that this is known: https://medium.com/criteo-labs/hyper-parameter-optimization-algorithms-2fe447525903 "One of the great drawback of tree-structured Parzen estimators is that they do not model interactions between the hyper-parameters. It it thus much less able than Gaussian processes when two hyper-parameters strongly interact."
Rosenbrock is a good example of a function where hyperparameters interact (not, strongly, necessarily, depends on the region in space).
In terms of the implementation vs the claims in the paper, apart from the ambiguity about conditional independence of variables (which isn't explicitly stated but @mikey-iqml suggests is implied), this seems to be doing as intended, both in hyperopt and here. It is worth noting that this doesn't seem to claim that it would find a global minima, so for example, landing within the valley on the Rosenbrock function is likely to be acceptable (and within context of hyperparameter searched, this is likely true).
There is some investigation to be done for some other strange behaviours (will attempt to summarise separately, since this was from a while ago and may now no longer be reproducible)
For those wishing to investigate the behaviour with a variety of non-convex functions, use the following (the expressions might be slightly wrong):
using TreeParzen
space = Dict(:x1 => HP.Uniform(:x1, -5., 5.), :x2 => HP.Uniform(:x2, -5., 5.));
sphere(;x1=0, x2=0) = x1.^2 + x2.^2
sphere(d) = sphere(;d...)
rosen(;x1=1,x2=1) = 100 * (x2 - x1.^2).^2 + (1 - x1).^2
rosen(d) = rosen(;d...)
rastrigin(;x1=0,x2=0) = 20 + x1.^2 - 10 * cos(2*pi*x1) + x2.^2 - 10 * cos(2*pi*x2)
rastrigin(d) = rastrigin(;d...)
ackley(;x1=0, x2=0) = -20 * exp(-0.2 * sqrt(x1.^2 + x2.^2)) - exp(0.5 * (cos(2*pi*x1) + cos(2*pi*x2))) + exp(1) + 20
ackley(d) = ackley(;d...)
fmin(ackley, space, 50)
fmin(sphere, space, 50)
fmin(rosen, space, 50)
fmin(rastrigin, space, 50)
I tried a couple of extra things with rosenbrock, including trying to see how badly it does with just 50 trials over a few thousand fmin attempts:
evaluator(result) = max(abs(1 - result[:x1]), abs(1 - result[:x2]))
function findworst(trials, attempts)
worst = 0.
worstres = Dict(:x1 => 1., :x2 => 1.)
r = 0.
for i in 1:attempts
fmr = fmin(rosen, space, trials)
r = evaluator(fmr)
if r > worst
worstres = fmr
end
end
return worstres
end
println(findworst(50, 5000))
Gives:
Dict{Symbol,Any}(:x2 => 0.3669191434041903,:x1 => 0.6124080254584413)
If we try with 100 optimisation trials:
println(findworst(100, 5000))
Dict{Symbol,Any}(:x2 => 0.6500374886802355,:x1 => 0.8044138837632724)
and if we try with 250 optimisation trials:
println(findworst(250, 5000))
Dict{Symbol,Any}(:x2 => 1.012649888929622,:x1 => 1.0049042342109367)
So we can see it is definitely different and better than before.
@tlienart, what do you think?
On the test functions given above, its well worth noting that they are examples for non-convex optimisers but apart from Rosenbrock are mostly lacking in that the examples don't have strong X,Y dependences, might be worth hunting a bit more for some test functions which aren't just non-convex
It would be nice to see how it compares to BlackBoxOptim.jl.
I'll have a try and report