IQVIA-ML / TreeParzen.jl

TreeParzen.jl, a pure Julia hyperparameter optimiser with MLJ integration
Other
35 stars 5 forks source link

Compare with BlackBoxOptim.jl #29

Open tlienart opened 4 years ago

tlienart commented 4 years ago

It would be nice to see how it compares to BlackBoxOptim.jl.

I'll have a try and report

tlienart commented 4 years ago

Am I using it wrong?

function rosenbrock2d(x)
    return (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
end

function obj(p::Dict)
    return rosenbrock2d([p[:x1], p[:x2]])
end

space = Dict(:x1 => hp_uniform(:x1, -5, 5), :x2 => hp_uniform(:x2, -5, 5));

fmin(obj, space, 50)

returns

FMin.run!: 50 / 50 trials carried out
Dict{Symbol,Float64} with 2 entries:
  :x2 => 3.29693
  :x1 => 1.79932

whereas BlackBoxOptim

res = bboptimize(rosenbrock2d; SearchRange = (-5.0, 5.0), NumDimensions = 2)
best_candidate(res)

returns the (accurate)

2-element Array{Float64,1}:
 0.9999999999961342
 0.9999999999925864

Any thoughts? even with 500 trials we're quite far off.. and this is for a function for which random search will give good results in a few 1000 steps

ghost commented 4 years ago

Here are some more trials

FMin.run!: 50 / 50 trials carried out
50 Dict(:x2 => 0.799543567919683,:x1 => 0.8538950940564665)
FMin.run!: 100 / 100 trials carried out
100 Dict(:x2 => 0.1643004860758952,:x1 => -0.41535766294995335)
FMin.run!: 200 / 200 trials carried out
200 Dict(:x2 => 0.1411307201249039,:x1 => 0.29090448536707836)
FMin.run!: 500 / 500 trials carried out
500 Dict(:x2 => 1.4765354269218587,:x1 => 1.2189874229177695)
FMin.run!: 1000 / 1000 trials carried out
1000 Dict(:x2 => 0.8353650505167384,:x1 => 0.915959493150407)
FMin.run!: 2000 / 2000 trials carried out
2000 Dict(:x2 => 0.9558778608402061,:x1 => 0.9772786544185521)

Not great. I'm going to put the same through the original hyperopt and see what happens.

ghost commented 4 years ago

Results from Python

# environment.yml

name: rosenbrock
channels:
  - defaults
dependencies:
  - pip
  - pip:
    - hyperopt ==0.1.2

The code

from hyperopt import hp, fmin, tpe

def rosenbrock2d(x1, x2):
    return (1.0 - x1) ** 2 + 100.0 * (x2 - x1 ** 2) ** 2

def obj(p):
    return rosenbrock2d(p['x1'], p['x2'])

space = {'x1': hp.uniform('x1', -5, 5), 'x2': hp.uniform('x2', -5, 5)}

result50 = fmin(obj, space, algo=tpe.suggest, max_evals=50)
print('50 ', result50)

result100 = fmin(obj, space, algo=tpe.suggest, max_evals=100)
print('100 ', result100)

result200 = fmin(obj, space, algo=tpe.suggest, max_evals=200)
print('200 ', result200)

result500 = fmin(obj, space, algo=tpe.suggest, max_evals=500)
print('500 ', result500)

result1000 = fmin(obj, space, algo=tpe.suggest, max_evals=1_000)
print('1000 ', result1000)

result2000 = fmin(obj, space, algo=tpe.suggest, max_evals=2_000)
print('2000 ', result2000)

The results

50/50 [00:00<00:00, 122.36it/s, best loss: 9.713418601918795]
50  {'x1': 0.45865419152303843, 'x2': 0.5172897711719768}

100/100 [00:00<00:00, 108.41it/s, best loss: 0.07758009764149705]
100  {'x1': 0.9408197663038045, 'x2': 0.8579245959270194}

200/200 [00:02<00:00, 97.25it/s, best loss: 0.016855271351971703]
200  {'x1': 1.1258791364608522, 'x2': 1.270781430200818}

500/500 [00:06<00:00, 72.80it/s, best loss: 0.004552196135317346]
500  {'x1': 1.0504683012406382, 'x2': 1.1079615383606018}

1000/1000 [00:18<00:00, 54.92it/s, best loss: 0.02637746060378691]
1000  {'x1': 0.9136235711797501, 'x2': 0.8484617831749768}

2000/2000 [00:56<00:00, 35.45it/s, best loss: 0.006896342055002183]
2000  {'x1': 0.9359273194623061, 'x2': 0.8706769238787768}

So our behaviour looks similar to the original.

Now the question is: is this a problem with hyperopt that we've copied over, or is this Rosenbrock task simply not a good demo of the tree-parzen technique?

tlienart commented 4 years ago

It could be Rosenbrock but it's a very simple example (especially in 2D). Either way the fact that the behaviour in Python is the same is reassuring :+1:

Before releasing the package though, it would be really useful to define a few examples in more than 1D where TreeParzen does well (and, at least, does better than random search). Maybe that the original package has non-trivial examples where this is the case?

Edit: upon closer inspection, it seems the Python version does better in fact? is this just a random artifact that TreeParzen.jl needs > 1000 rounds to produce a valid estimate whereas in the original lib it's "ok-ish" after ~100?

yalwan-iqvia commented 4 years ago

Rosenbrock is a great acid test for non-convex optimisers, however it might be worth taking a look at higher ND variants, just in case this is a weird artifact of TP in low dimensional spaces (I would still find this odd though, but we know that random seach cannot do well in larger ND spaces so anything doing something sensible should find better results) -- might be time to go back to the original paper. I don't put it past the implementation in hyperopt to simply be wrong (and even a reference implementation provided by TP authors, if such a thing exists)

yalwan-iqvia commented 4 years ago

It's also worth bearing in mind we've an ordinary R^N space here but TP isn't designed to handle only R^N spaces. Perhaps we should stack it up against something with a space containing some discrete axes and see what happens here (I appreciate that this is actually a non-trivial task to come up with a useful space here we can glean something from, which is also why its useful, because it could help us to have a useful acid test going forward)

yalwan-iqvia commented 4 years ago

https://facebookresearch.github.io/nevergrad/benchmarking.html

^ We might be able to use this to get some acid tests, and also have more things to compare with. The library of optimisers in this package is impressive

yalwan-iqvia commented 4 years ago

Started reading TreeParzen paper (it turns out hyperopt is indeed the reference implementation of this if we consider that the primary paper author and the primary package author are the same)

I want to point at a certain element of the paper I came across so far:

We keep Estimation of Distribution (EDA) approach on the discrete part of our input space, where we sample candidate points according to binomial distributions, while we use the Covariance Matrix Adaptation - Evolutionary Strategy (CMAES) for the remaining part of our space (continuous hyper parameters)

Ok, cool.

Lets go find a CMAES implementation in juia ... https://github.com/Staross/JuliaCMAES

Ok, can it be trusted, I don't know. But the README has a few things which immediately grab ones attention:

Test on the Rosenbrock function:
...
iter: 100    fcount: 600     fval: 3.69e-08      axis-ratio: 2.22e+01
...
pmin:
[1.0000052051864698,1.0000108150577227]

fmin:
4.346875072347545e-11

Hey! This didn't struggle with a real valued space -- specifically, rosenbrock ... hmmmmmm To me this does strongly suggest that ... hyperopt is wrong. Somewhere.

ghost commented 4 years ago

Interesting. Yes I don't recall hearing of CMAES anywhere in the hyperopt code.

yalwan-iqvia commented 4 years ago

After having had a chat with Mikey he has convinced me that this is a red herring: i.e. they were not covering this as in "TPE is this way" but "this is how GP styled optimisers might work"

However we've agreed that even in the absence of the tree structure of hyperparameters the probabilistic modelling shouldn't result in such poor optimising... still more pontificating to be done.

yalwan-iqvia commented 4 years ago

https://en.wikipedia.org/wiki/Test_functions_for_optimization

^ Theres a lot of fun to be had here

ghost commented 4 years ago

@yalwan-iqvia Please can you write up your previous findings into a new issue on https://github.com/IQVIA-ML/TreeParzen.jl

yalwan-iqvia commented 4 years ago

An attempt to summarise the findings so far:

It looks like that although not explicitly stated, the modelling system from the paper assumes conditional independence of the variables being optimised.

The implementation fits distributions to each variable in a marginal fashion (i.e. univariate fitting) . Even siblings from the same parent node in the tree structure (if any) are fit this way. Let us take an example for a 2d distribution (such as rosenbrock or parabola): We've fit a distribution to our X component, and now we go about immediately sampling from X. Rather than fitting a distribution to Y and then sampling from each and finding the pair with the best EI, we select first the X with the best EI and then given this X select also the Y with the best EI. Note that the only way to actually fit this would be to model X,Y jointly and sample from that distribution (which I don't deny is nontrivial)

It seems to be that this is known: https://medium.com/criteo-labs/hyper-parameter-optimization-algorithms-2fe447525903 "One of the great drawback of tree-structured Parzen estimators is that they do not model interactions between the hyper-parameters. It it thus much less able than Gaussian processes when two hyper-parameters strongly interact."

Rosenbrock is a good example of a function where hyperparameters interact (not, strongly, necessarily, depends on the region in space).

In terms of the implementation vs the claims in the paper, apart from the ambiguity about conditional independence of variables (which isn't explicitly stated but @mikey-iqml suggests is implied), this seems to be doing as intended, both in hyperopt and here. It is worth noting that this doesn't seem to claim that it would find a global minima, so for example, landing within the valley on the Rosenbrock function is likely to be acceptable (and within context of hyperparameter searched, this is likely true).

There is some investigation to be done for some other strange behaviours (will attempt to summarise separately, since this was from a while ago and may now no longer be reproducible)

For those wishing to investigate the behaviour with a variety of non-convex functions, use the following (the expressions might be slightly wrong):

using TreeParzen

space = Dict(:x1 => HP.Uniform(:x1, -5., 5.), :x2 => HP.Uniform(:x2, -5., 5.));

sphere(;x1=0, x2=0) = x1.^2 + x2.^2
sphere(d) = sphere(;d...)
rosen(;x1=1,x2=1) = 100 * (x2 - x1.^2).^2 + (1 - x1).^2
rosen(d) = rosen(;d...)
rastrigin(;x1=0,x2=0) = 20 + x1.^2 - 10 * cos(2*pi*x1) + x2.^2 - 10 * cos(2*pi*x2)
rastrigin(d) = rastrigin(;d...)
ackley(;x1=0, x2=0) = -20 * exp(-0.2 * sqrt(x1.^2 + x2.^2)) - exp(0.5 * (cos(2*pi*x1) + cos(2*pi*x2))) + exp(1) + 20
ackley(d) = ackley(;d...)

fmin(ackley, space, 50)
fmin(sphere, space, 50)
fmin(rosen, space, 50)
fmin(rastrigin, space, 50)

I tried a couple of extra things with rosenbrock, including trying to see how badly it does with just 50 trials over a few thousand fmin attempts:

evaluator(result) = max(abs(1 - result[:x1]), abs(1 - result[:x2]))

function findworst(trials, attempts)

    worst = 0.
    worstres = Dict(:x1 => 1., :x2 => 1.)
    r = 0.

    for i in 1:attempts
        fmr = fmin(rosen, space, trials)
        r = evaluator(fmr)
        if r > worst
            worstres = fmr
        end
    end

    return worstres

end

println(findworst(50, 5000))

Gives:

Dict{Symbol,Any}(:x2 => 0.3669191434041903,:x1 => 0.6124080254584413)

If we try with 100 optimisation trials:

println(findworst(100, 5000))
Dict{Symbol,Any}(:x2 => 0.6500374886802355,:x1 => 0.8044138837632724)

and if we try with 250 optimisation trials:

println(findworst(250, 5000))
Dict{Symbol,Any}(:x2 => 1.012649888929622,:x1 => 1.0049042342109367)

So we can see it is definitely different and better than before.

@tlienart, what do you think?

yalwan-iqvia commented 4 years ago

On the test functions given above, its well worth noting that they are examples for non-convex optimisers but apart from Rosenbrock are mostly lacking in that the examples don't have strong X,Y dependences, might be worth hunting a bit more for some test functions which aren't just non-convex