bayesian-optimization / BayesianOptimization

A Python implementation of global optimization with gaussian processes.
https://bayesian-optimization.github.io/BayesianOptimization/index.html
MIT License
7.92k stars 1.54k forks source link

How to deal with categorical parameters #191

Open jiwidi opened 4 years ago

jiwidi commented 4 years ago

Hi! I was reading the README where it says:

Take a look at the advanced to our notebook to learn how to make the package more flexible, how to deal with categorical parameters, how to use observers, and more.

But I can't find any reference to how to use categorical parameters in the notebook, could someone throw some light into this?

Thanks

raff7 commented 4 years ago

Hi, the advanced notebook suggests a very vert basic approach for discrete variables that doesn't work very well. To deal with categorical variables it's even more complicated. I'm currently implementing it myself and I found a way that works extremely well. Basically for every categorical variable you create n new parameters (where n is the number of possible categories for that variable). You can then use these variables as a one hot encoder, when testing the function you check the highest one of the n new variables and you use the category corresponding to that.

To make the GP work proprelly you will also have to create a custom kernel function that inherites the propriety of the one used by default. The only difference is that when computing the covariance of two points, for every categorical variable you have to set the dummy variable with the max value to 1 and all other dummy variables to 0.

A similar approach can be used for discrete numerical variables, by rounding the number withing the kernel function you will get a step-shaped Gaussian process that has much better proprieties then the approach suggested in the advanced notebook

amapic commented 4 years ago

@raff7 Hi, thanks for sharing. Can you give me your code to realise that ? I try the method of the advanced notebook and it doesn't seem to work. For what I saw the algorythm failed on interpreting the right continuous variable to choose when some are categorical.

leifvan commented 4 years ago

Hi @amapic, I experimented with the suggestion of @raff7 myself and gladly share a code example. If I understand it correctly, the goal is to give the kernel function the actual values that the function-to-optimize will use so the correct covariances are calculated. It should suffice to subclass the kernel function and wrap it with the corresponding code.

The BayesianOptimization class uses the sklearn.gaussian_process.kernels.Matern kernel by default, so you could wrap it like this:

class WrappedMatern(Matern):
  def __init__(self, param_types, length_scale=1.0, length_scale_bounds=(1e-5, 1e5), nu=1.5):
    self.param_types = param_types
    super(WrappedMatern, self).__init__(length_scale, length_scale_bounds, nu)

  def __call__(self, X, Y=None, eval_gradient=False):
    X = X.copy()  # copy X, just to be sure

    # we collect the positions of the categorical variables in this dict
    categorical_group_indices = defaultdict(list)

    for i, ptype in enumerate(self.param_types):
      if ptype == 'continuous':
        pass # do nothing
      elif ptype == 'discrete':
        X[:, i] = np.round(X[:, i])
      else:
        categorical_group_indices[ptype].append(i)

    # set binary max for categorical groups
    for indices in categorical_group_indices.values():
      max_col = np.argmax(X[:, indices], axis=1)
      X[:, indices] = 0
      X[range(X.shape[0]), max_col] = 1

    return super(WrappedMatern, self).__call__(X, Y, eval_gradient)

When creating the optimizer instance, you have to replace the _gp variable with a new sklearn.gaussian_process.GaussianProcessRegressor and, while you're at it, increase the alpha value a bit so the process handles discontinuities better. Here is a toy example with a simple target function:

def f(select_a, select_b, x):
  x = int(x)
  if select_a > select_b:
    return -(x**2 + 2*x + 2)  # has maximum -1 at x=-1
  else:
    return -x**2  # has maximum 0 at x=0

modified_kernel = WrappedMatern(param_types=(0,0,'discrete'), nu=2.5)

bo = BayesianOptimization(f, pbounds=dict(select_a=(0,1),
                                          select_b=(0,1),
                                          x=(-100,100)))

bo._gp = GaussianProcessRegressor(kernel=modified_kernel,
                                  alpha=1e-2,  #alpha=1e-6,
                                  normalize_y=True,
                                  n_restarts_optimizer=5,
                                  random_state=bo._random_state)

Here, select_a and select_b are a one-hot encoded categorical variable selecting between two different output functions (for more than two values you would have to replace the simple if clause with something more sophisticated). The WrappedMatern constructor gets a tuple containing the type of each parameter (positionally): 'discrete', 'continuous' or an identifier that is equal for all parameters that belong to the same categorical variable. In this case, to group select_a and select_b together, the first two values of the tuple are 0.

Now you can run bo.maximize() and enjoy using categorical and discrete parameters!

ben-arnao commented 4 years ago

I posted about this in another repo

https://github.com/keras-team/keras-tuner/issues/287

I think there is an issue with simply rounding to discrete values due to the nature of BO. Since categorical parameters are unordered, there can be an unwanted bias in parameter selection simply due to the order we supply our parameters in.

When we fit the GPR, the prior will be higher for values next to a maxima. While this is fine for an ordered space, for an unordered parameter space this means that some categorical values will be default be seen as higher priority search areas, all else the same.

To solve this we simply randomize the order prior each iteration prior to each fit, and then use the current iteration's randomized order when selecting from the acquisition function. This ensure that we can still use categorical variables in BO, but no values have an unearned bias in priority.

leifvan commented 4 years ago

I think there is an issue with simply rounding to discrete values due to the nature of BO. Since categorical parameters are unordered, there can be an unwanted bias in parameter selection simply due to the order we supply our parameters in.

The rounding above only applies to numerical variables that are limited to integer values. The idea for categorical variables is to one-hot encode them, meaning if you have a variable with categories {'A', 'B', 'C'}, you create 3 helper variables each with values in the real interval [0,1] and set the highest one to 1 and the others to 0. If you think of it as a vector, the possible states (1,0,0), (0,1,0), (0,0,1) are all equidistant and no ordering is implied. This holds true for any number of categories.

Here is a paper by Garrido-Merchán and Hernández-Lobato supporting the approach @raff7 described. @ben-arnao, I am not sure if/how your approach is affected by the problems described in this paper. Isn't the randomization just causing noise in the acquisition function, s.t. the BO cannot make any prediction w.r.t. to the categorical parameters?

ben-arnao commented 4 years ago

@leifvan Hmm, i see. Are the helper variables separate dimensions when it comes to fitting the GPR? They would need to be in order to avoid the ordering bias i am discussing.

I think this approach could work but how would it affect BO from a dimensionality standpoint? It is known the BO breaks down in higher dimensional space. This technique seems like it could very quickly start to break down BO with more dimensions than the optimization is recommended for.

amapic commented 4 years ago

Hi @leifvan @ben-arnao, thanks for your answers, it helped me a lot.

amapic commented 4 years ago

@leifvan Juste to be sure, do you think this approach has downside ? I don't have enough knowledge myself for that.

raff7 commented 4 years ago

@amapic i cant think of any downside of rounding the variable withing the kernel, and from my experiments, I can say with a decent amount of confidence that this method is the best to deal with discrete numerical variables. I see only one problem in @leifvan 's code, which is that it assumes that there is only 1 categorical variable, as i don't see any mechanism to distinguish 2 different categorical variables, i can share my code but keep in mind that i modified the source code quite a bit so some stuff might not make much sense in the original code: This is my kernel:

from bisect import bisect_left
import GPy
from time import time
import numpy as np
class augKernel(GPy.kern.Matern52):
    def __init__(self,input_dim, variance=1., ARD=False,space=None):
        self.space = space
        super(augKernel,self).__init__(input_dim=input_dim, variance=variance,ARD=ARD)

    def K(self, X, X2=None):
        # t = time()
        catIDs = self.space.get_categorical_index_in_model()
        for cat in catIDs:#normalize categorical for X
            am = np.argmax(X[:,cat],axis=1)
            X[:,cat]= np.zeros(X[:,cat].shape)
            for i,a in enumerate(am):
                n=np.zeros(X[i,cat].shape)
                n[a]=1
                X[i,cat] = n

        if(X2 is not None):#normalize categorical for X2
            for cat in catIDs:#normalize categorical for X
                am = np.argmax(X2[:,cat],axis=1)
                X2[:,cat]= np.zeros(X2[:,cat].shape)
                for i,a in enumerate(am):
                    n=np.zeros(X2[i,cat].shape)
                    n[a]=1
                    X2[i,cat] = n

        discIDs, values = self.space.get_discrete_index_in_model()   
        for disc,val in zip(discIDs,values): #round discrete data
            for Xid in range(len(X)):
                X[Xid][disc] = self.take_closest(val,X[Xid][disc])#min(val, key=lambda x:abs(x-X[Xid][disc]))
        if(X2 is not None):
            for disc,val in zip(discIDs,values): #round discrete data
                for Xid in range(len(X2)):
                    X2[Xid][disc] = self.take_closest(val,X2[Xid][disc])#min(val, key=lambda x:abs(x-X2[Xid][disc]))#

        # print("K time {}".format(time()-t))
        return super().K( X, X2)

    def Gram_matrix(self, F, F1, F2, F3, lower, upper):
        raise NotImplementedError()

    def take_closest(self, myList, myNumber):
        """
        Assumes myList is sorted. Returns closest value to myNumber.

        If two numbers are equally close, return the smallest number.
        """
        pos = bisect_left(myList, myNumber)
        if pos == 0:
            return myList[0]
        if pos == len(myList):
            return myList[-1]
        before = myList[pos - 1]
        after = myList[pos]
        if after - myNumber < myNumber - before:
            return after
        else:
            return before

p.s.

i don't use round because sometimes the discrete numerical variables are not as simple as being consecutive integer.. say you have to decide between a batch size of 8, 16 and 32. instead i just get the closest number between all of the possible values the variable can take

raff7 commented 4 years ago

@leifvan Hmm, i see. Are the helper variables separate dimensions when it comes to fitting the GPR? They would need to be in order to avoid the ordering bias i am discussing.

I think this approach could work but how would it affect BO from a dimensionality standpoint? It is known the BO breaks down in higher dimensional space. This technique seems like it could very quickly start to break down BO with more dimensions than the optimization is recommended for.

@ben-arnao yes, the limitation of this method is indeed the extra dimensions it creates, however this is the only method to avoid an order bias in categorical variables, as @leifvan explained, your randomization method would probably not help much, and it will cause a lot of noise for the categorical variables, making them basically useless

ben-arnao commented 4 years ago

@leifvan Hmm, i see. Are the helper variables separate dimensions when it comes to fitting the GPR? They would need to be in order to avoid the ordering bias i am discussing. I think this approach could work but how would it affect BO from a dimensionality standpoint? It is known the BO breaks down in higher dimensional space. This technique seems like it could very quickly start to break down BO with more dimensions than the optimization is recommended for.

@ben-arnao yes, the limitation of this method is indeed the extra dimensions it creates, however this is the only method to avoid an order bias in categorical variables, as @leifvan explained, your randomization method would probably not help much, and it will cause a lot of noise for the categorical variables, making them basically useless

I wouldn't say it would make them useless, but the performance would probably be affected to some degree. I think it would be interesting to do some tests to see how degraded the performance would be with random optimization and onehotencoded params as baselines.

In an ideal world, yes onehotencoding categoricals would be best but i think with many real-life BO problems, dimensionality is a serious consideration.

I do think it is a tradeoff and there is a point in which one will outweigh the other.

raff7 commented 4 years ago

@leifvan Hmm, i see. Are the helper variables separate dimensions when it comes to fitting the GPR? They would need to be in order to avoid the ordering bias i am discussing. I think this approach could work but how would it affect BO from a dimensionality standpoint? It is known the BO breaks down in higher dimensional space. This technique seems like it could very quickly start to break down BO with more dimensions than the optimization is recommended for.

@ben-arnao yes, the limitation of this method is indeed the extra dimensions it creates, however this is the only method to avoid an order bias in categorical variables, as @leifvan explained, your randomization method would probably not help much, and it will cause a lot of noise for the categorical variables, making them basically useless

I wouldn't say it would make them useless, but the performance would probably be affected to some degree. I think it would be interesting to do some tests to see how degraded the performance would be with random optimization and onehotencoded params as baselines.

In an ideal world, yes onehotencoding categoricals would be best but i think with many real-life BO problems, dimensionality is a serious consideration.

I do think it is a tradeoff and there is a point in which one will outweigh the other.

@ben-arnao yes you are right, mine are just speculations as I didn't test it, the only way to know for sure is to try, however I would expect the randomization to degrade performance even more than the "order bias" would, so I would suggest comparing your method of randomization with the simple method of considering categorical variables as simple numerical variables, just to make sure it isn't just worst than doing nothing... Let us know if you do end up experimenting with this because it's very interesting!

leifvan commented 4 years ago

Multiple categorical parameters

I see only one problem in @leifvan 's code, which is that it assumes that there is only 1 categorical variable, as i don't see any mechanism to distinguish 2 different categorical variables

I glossed over that part as my code was only intended as a proof-of-concept, but there is a mechanism for multiple categorical parameters. They are identified by some hashable value, e.g. in the following line from my example above

modified_kernel = WrappedMatern(param_types=(0,0,'discrete'), nu=2.5)

the param_types parameter tells the kernel that the first two parameters of the target function f are belonging together, as they have the same identifier 0. Any hashable value except 'discrete' and'continuous' can be used, e.g. if we have two categorical parameters with 3 choices each, we could write

WrappedMatern(param_types=('a','a','a',7,7,7))

I tested my code again yesterday and there is a breaking error in there. If provided, the Y parameter has to be transformed, too. Otherwise the calculated covariances do not make much sense. @amapic, in case you actually use it: I put the updated code in this gist.

Limitations of the one-hot encoding

@ben-arnao yes, the limitation of this method is indeed the extra dimensions it creates, however this is the only method to avoid an order bias in categorical variables, as @leifvan explained, your randomization method would probably not help much, and it will cause a lot of noise for the categorical variables, making them basically useless

I agree with you, @raff7. One additional problem that Nguyen et al. point out in their recent work is that both rounding and setting the covariances of categorical variables to 0 or 1

makes the resulting acquisition function step-wise, which is hard to optimize.

Regarding the discussion of the shuffling approach

I'm not sure if I 100% understood how your approach works, @ben-arnao, but I will try to explain by example why I don't think it provides a benefit. Suppose we have a categorical variable with three choices [A,B,C] with f(A)=100, f(B)=10 and f(C)=0. We represent the variable by a float x with values between 0 and 3 and floor the variable to select one of the choices. The BO chooses the next value as the one with the highest utility, based on past observations. In the beginning, all choices have equal utility, so let's say it picks x1=1.3, which results in selecting B. It measures the target value y1=f(1.3)=10. Now we shuffle the categories and e.g. get [B,A,C]. The acquisition function still expects y1 at the location x1. Say it selects x2=1.8 and gets a different value y2=f(1.8)=100, which translates to selecting A. The acquisition function now expects a high variance of values between x1 and x2 and might be inclined to choose from there again. But there is no point in knowing that, as a value between x1 and x2 can lead to any of the categories in the next step. My intuition tells me that this leads to a completely random choice of x. I don't see how the BO can make use of the shuffling. That is not to say that this isn't a sound approach - maybe a randomized parameter search works well for categorical variables. But if your approach is meant as I described in the example, I think we don't need to include that in the BO but randomize the value in another place.

Let us know if you do end up experimenting with this because it's very interesting!

+1 for that! In the meantime, the paper of Nguyen et al. I referenced above contains experiments using synthetic & real-world data where they compare a naive one-hot encoding with the kernel rounding method (called "Merchan-Lobato" in the paper). The approach they are proposing also sounds promising, so if we really start benchmarking our suggestions it might be interesting to see how their approach stacks up.