fgerzer / apsis

Other
25 stars 4 forks source link

NominalParamDef and BayOpt #117

Closed craffel closed 8 years ago

craffel commented 9 years ago

If I'm understanding correctly, NominalParamDef instances are currently not compatible with the 'BayOpt' optimizer. The way I found this out initially was from the error

Traceback (most recent call last):
  File "apsis/assistants/experiment_assistant.py", line 128, in get_next_candidate
    self.optimizer.get_next_candidates(self.experiment))
  File "apsis/optimizers/bayesian_optimization.py", line 127, in get_next_candidates
    self._refit(experiment)
  File "apsis/optimizers/bayesian_optimization.py", line 158, in _refit
    warped_in = experiment.warp_pt_in(c.params)
  File "apsis/models/experiment.py", line 278, in warp_pt_in
    warped_in[name] = self.parameter_definitions[name].warp_in(value)
AttributeError: 'NominalParamDef' object has no attribute 'warp_in'

I think this could be handled better - e.g. by checking that all parameters have the method warp_in on initialization of the experiment and raising an appropriate error if not. Alternatively, if possible, it seems like it'd be worthwhile to provide a ParamDef which implements the simple hack of quantizing a floating point value.

craffel commented 9 years ago

Here's a hack for a "quantize-to-list" parameter definition:

class QuantizedListParamDef(apsis.models.parameter_definition.NumericParamDef):
    '''
    Defines a parameter value which can take one of a finite number of
    enumerable values by quantizing the space between [0, 1] and mapping it
    to indices of the values.
    '''
    values = []

    def __init__(self, values):
        '''
        Initializes the possible values the parameter can take.

        Parameters
        ----------
        values : list
            List of permissable values.
        '''
        try:
            self.values = list(values)
        except:
            raise ValueError('values must be a list of possible values.')
        self.values = values

    def warp_in(self, value_in):
        return self.values.index(value_in)/float(max(len(self.values) - 1, 1))

    def warp_out(self, value_out):
        return self.values[int((value_out - 1e-10)*len(self.values))]

    def is_in_parameter_domain(self, value):
        return self.values.count(value) > 0

I get the following warning: WARNING - L-BFGS-B Optimization produced only parameter values thatdo not match the defined parameter range. This indicates that increasing the parameter definition range mightdeliver better results.Using results from RandomSearch. It appears that the optimizer uses the bound [0, 1] for all parameters, and then tests whether the resulting parameter values are valid. Given that those are the bounds (this appears to be true even for MinMaxNumericParamDef instances where the range is not [0, 1]), it seems like the resulting values should be warped before being used, but I'm likely just misunderstanding the intended usage. Any tips are appreciated.

fgerzer commented 9 years ago

First of all, thanks for the issue!

Your first issue is a limitation for Bayesian Optimization coupled with an overlooked error message by us. This is similar to #111. The function to check for it exists, it just wasn't used. The branch i111_exp_check has a fix for it, and we'll include it in the next release. It's pretty difficult to warp a categorical parameter definition to something made up of floats; to my knowledge, apart from warping it into an n-dimensional [0, 1] hypercube for n different possible values of the categorical parameter, there's no paper describing a way to do it. The hypercube model is, apparently, comparatively inefficient why we haven't implemented it. I'll do so for the next release, but am currently working on support for multiple workers/clusters, which will still take some time.

As to your hack: There already exists a class of parameter definitions which you can modify for that, which is PositionParamDef. PositionParamDef maintains one list each for a value and a position for that value, whose warping will be done automatically. By defining something like this:

class EquidistantPositionParamDef(PositionParamDef):
    """
    Extension of PositionParamDef, in which the position of each value is
    equidistant from its neighbours and their order is determined by their
    order in values.
    """
    def __init__(self, values):
        positions = []
        for i, v in enumerate(values):
            pos = float(i)/(len(values)-1)
            positions.append(pos)
        super(EquidistantPositionParamDef, self).__init__(values, positions)

you can then initialize it with your categorical parameter definition. The above should work, and is pushed to the branch i_117_equidistant_fix. However, I am not sure it will actually help you (much) with categorical parameters, because it implies a certain structure in these parameters which might or might not exist. For example, if you use [sigmoid, rectifiers, tanh] as activations for a neural network layer, depending on the order in which you define them, there will be different results. This is because the GP will assume that, in the example above, the result of a network with only the activation changed from sigmoid to rectifiers will be closer to the sigmoid result than if you would change it to tanh. Of course, it's probably still better than random search :-)

I hope that the above helped you, and I'd be happy if you could tell us about your experiences with it.

craffel commented 9 years ago

First of all, thanks for the issue!

Thanks for your thorough response!

Your first issue is a limitation for Bayesian Optimization coupled with an overlooked error message by us. This is similar to #111. The function to check for it exists, it just wasn't used. The branch i111_exp_check has a fix for it, and we'll include it in the next release.

Great, thanks.

It's pretty difficult to warp a categorical parameter definition to something made up of floats; to my knowledge, apart from warping it into an n-dimensional [0, 1] hypercube for n different possible values of the categorical parameter, there's no paper describing a way to do it. The hypercube model is, apparently, comparatively inefficient why we haven't implemented it. I'll do so for the next release, but am currently working on support for multiple workers/clusters, which will still take some time.

Wouldn't this still require quantization; specifically, quantization around .5 for each hypercube dimension (or something)?

As to your hack: There already exists a class of parameter definitions which you can modify for that, which is PositionParamDef. PositionParamDef maintains one list each for a value and a position for that value, whose warping will be done automatically. ... The above should work, and is pushed to the branch i_117_equidistant_fix.

Cool, I will try it now. As a side note, from a development standpoint, it would be cool if all of the "stable, to be included in the next release" branches were merged into a single "development" branch, so it was easy to use e.g. i111_exp_check and i_117_equidistant_fix at the same time.

However, I am not sure it will actually help you (much) with categorical parameters, because it implies a certain structure in these parameters which might or might not exist. For example, if you use [sigmoid, rectifiers, tanh] as activations for a neural network layer, depending on the order in which you define them, there will be different results. This is because the GP will assume that, in the example above, the result of a network with only the activation changed from sigmoid to rectifiers will be closer to the sigmoid result than if you would change it to tanh. Of course, it's probably still better than random search :-)

From a high-level point, I'm not clear why order is important - it seems to me that the GP wouldn't care about order as it's not assuming any sort of consistent correlation between a change in hyperparameter values and objective values, right?

Thanks for your responses!

craffel commented 9 years ago

For what it's worth, I still get the warning WARNING - L-BFGS-B Optimization produced only parameter values thatdo not match the defined parameter range. This indicates that increasing the parameter definition range mightdeliver better results.Using results from RandomSearch. when using EquidistantPositionParamDef. As before, L-BFGS-B is constrained for all parameters on the interval [0., 1.], and the check is_in_parameter_domain for EquidistantPositionParamDef is just the one for NominalParamDef:

    def is_in_parameter_domain(self, value):
        """
        Tests whether value is in self.values as defined during the init
        function.
        """
        return value in self.values

which is typically false, because self.values in EquidistantPositionParamDef is just a list of arbitrary values which may or may not include the values in [0., 1.]. Any ideas?

fgerzer commented 9 years ago

Wouldn't this still require quantization; specifically, quantization around .5 for each hypercube dimension (or something)?

Not quite. If you are using Automated Relevance Determination (ARD, see for example Gaussian Processes in Machine Learning pg 106, of which an online version can be found here) - which apsis does by default - you can learn different length scales for the different parameters.

Suppose, for example, that you have three categorical values, A, B and C. A is encoded by [1, 0, 0], B by [0, 1, 0] and C by [0, 0, 1]. Gaussian processes define similarity via a distance metric; let's just use Manhattan distance here. Without modification of the distances, all distances are 2.

Using ARD, you now can learn some distance factor per dimension. In the above, maybe this factor is 0.25, 0.5, 3 for the three dimensions. This means our learned distances are now AB=0.75, BC=3.5 and AC=3.25 - and suddenly, you have learned that exchanging A for B means the result is closer to the original than exchanging C for B. There are - apparently - still some shortcomings of this, mostly corresponding to difficulties of the GP to adequately explore the corners of the hypercube (which is where most of the interesting stuff happens using this method).

From a high-level point, I'm not clear why order is important - it seems to me that the GP wouldn't care about order as it's not assuming any sort of consistent correlation between a change in hyperparameter values and objective values, right?

Correct, but it does care about the distance between the points. For example, assume you have three points X, Y and Z. You know the results for X and Z, and want to find out the result of Y. If the distance between X and Y is less than the distance between Y and Z, the predicted result for Y is going to be closer to the result of X than to the result of Z. To visualize this, look at this picture, which you can also find in the paper on page 2.: bay_opt_single2 The red point's result depends more on the result of the three points to the right of it than of the three points to the left. (I just notice that this might pose problems to red/green colour blindness; the red point is the middle one at ca 0.7.) This is also why projecting categorical data on a single [0, 1] axis may pose a problem. If you project the classes from our example above to [A, B, C] --> [0, 0.5, 1] you implicitly tell the GP that the distances AB = BC = 2_AC compared to the above where AB = 4.66_BC = 4.33_AC. And, had you changed the order to [A, C, B] --> [0, 0.5, 1], you'd have distances of AC = CB = 2_AB, which is pretty much the opposite compared to the distances learned above.

Cool, I will try it now. As a side note, from a development standpoint, it would be cool if all of the "stable, to be included in the next release" branches were merged into a single "development" branch, so it was easy to use e.g. i111_exp_check and i_117_equidistant_fix at the same time.

That's a very good point. I've created it.

For what it's worth, I still get the warning WARNING - L-BFGS-B Optimization produced only parameter values thatdo not match the defined parameter range. This indicates that increasing the parameter definition range mightdeliver better results.Using results from RandomSearch. when using EquidistantPositionParamDef. As before, L-BFGS-B is constrained for all parameters on the interval [0., 1.], and the check is_in_parameter_domain for EquidistantPositionParamDef is just the one for NominalParamDef:

    def is_in_parameter_domain(self, value):
        """
        Tests whether value is in self.values as defined during the init
        function.
        """
        return value in self.values

which is typically false, because self.values in EquidistantPositionParamDef is just a list of arbitrary values which may or may not include the values in [0., 1.]. Any ideas?

I've looked through the code again, and this is a pretty relevant bug you've detected. I've corrected it.

Some more details, if you are interested: Currently, the optimization of the acquisition function works like this: It generates a set of random points equivalent to initial_random_runs, then (unless you've chosen a random optimizer) it optimizes with these as its starting points. Remember that we can sample from the acquisition function several orders of magnitude faster than from the real objective function (significantly less than a second compared to minutes, hours or days for the bigger machine learning problems).

Afterwards, for each of the generated samples, we check whether it's a correct sample. If it isn't, we add the random sample from which the optimization originated instead.

This check was wrongly implemented. Instead of checking whether the generated parameter vector was in [0, 1] ^N (we're working with only a [0, 1] hypercube in AcquisitionFunction), we checked whether each of the generated parameter values was an acceptable value for the parameter. So, for example, if we had a MinMaxNumericParamDef instance whose value could be between [0, 0.5], we'd have checked whether the internal hyper-cube value was between [0, 0.5] which would have meant that we'd only use L-BFGS optimized values when optimizing on parameter between [0, 0.25] (after warping out).

This would've resulted (especially with things like DistanceBasedParamDef) in the optimizer only using random optimization on the acquisition function instead of L-BFGS. This is worse than using a real optimizer, but it's probably still better than RandomSearch which is why we've only logged a warning instead of an exception.

Instead, we're now checking whether the vector is actually in [0, 1]^N, and I've clarified the is_in_parameter_domain docstring to clearly state that we should only work with original, non-warped parameter values in that.

craffel commented 9 years ago

Correct, but it does care about the distance between the points.

Ah, thanks for your thorough explanations - I see now why the hypercube method makes a lot more sense in this setting. In that case, I may take a hack at implementing that, as all of my discrete parameters only have 3 or 4 values.

That's a very good point. I've created it.

Thanks!

I've looked through the code again, and this is a pretty relevant bug you've detected. I've corrected it. ... nstead, we're now checking whether the vector is actually in [0, 1]^N, and I've clarified the is_in_parameter_domain docstring to clearly state that we should only work with original, non-warped parameter values in that.

Thanks - this is what I had suspected was happening. Will give everything a go again.

fgerzer commented 9 years ago

In that case, I may take a hack at implementing that, as all of my discrete parameters only have 3 or 4 values. You've given me the motivation to implement it myself. Look into the i_117_nominal_hypercube branch. I've not yet merged it into dev because I had no chance to test it on nominal parameters, but all the other tests still pass. If it works for you, I'll write a few more tests then merge it to dev.

The main difference is that now, instead of warping any given parameter to just [0, 1] we warp it to [0, 1]^N where N is given in param_def by the warped_size function. This is (usually) 1, except for, for example, NominalParamDef.

craffel commented 9 years ago

Sorry, I'm not seeing param def in the i_117_nominal_hypercube branch which warps onto the [0, 1]^N hypercube, can you point me to the right one to try? Thanks.

fgerzer commented 9 years ago

Any ParamDef does so now. Granted, most of them still point to [0, 1]^1 as before.

All classes inheriting from ParamDef must implement the warp_in, warp_out and warping_size functions (see lines 47-89). The main change in practice is to NominalParamDef and all classes inheriting from that - in your case, simply using NominalParamDef should work; the new warping is done internally.

craffel commented 9 years ago

Ok, trying it out and I'm getting an error because NominalParamDef calls warped_value.index in warp_out, and warp_out is called by RandomSearch's _gen_param_val method with a np.ndarray. A simple fix would be to always cast warped_value to a list in warp_out.

  File "apsis/assistants/experiment_assistant.py", line 128, in get_next_candidate
    self.optimizer.get_next_candidates(self.experiment))
  File "apsis/optimizers/bayesian_optimization.py", line 130, in get_next_candidates
    return self.random_searcher.get_next_candidates(experiment, num_candidates)
  File "apsis/optimizers/random_search.py", line 50, in get_next_candidates
    list.append(self._get_one_candidate(experiment))
  File "apsis/optimizers/random_search.py", line 57, in _get_one_candidate
    value_dict[key] = self._gen_param_val(param_def)
  File "apsis/optimizers/random_search.py", line 77, in _gen_param_val
    return param_def.warp_out(self.random_state.uniform(0, 1, param_def.warped_size()))
  File "apsis/models/parameter_definition.py", line 181, in warp_out
    return self.values[warped_value.index(max(warped_value))]
AttributeError: 'numpy.ndarray' object has no attribute 'index'
fgerzer commented 9 years ago

I've both changed random search's call and added the conversion in NominalParamDef.

Also, since I've written some tests for it, I have fixed the creation of L-BFGS-Bounds which had a bug only occuring when using parameter definitions whose warping_size was not 1 and fixed AsymptoticParamDef's warping.

Sorry for not testing for those before; I wanted to get you the changes yesterday. All unit tests are working now, and include NominalParamDef for BayOpt.

craffel commented 9 years ago

Seems like it's working to me! Thanks for all your help!

fgerzer commented 9 years ago

Awesome! How's your experience with it?

craffel commented 9 years ago

I've been running random hyperparameter search for a while and have your BayOpt version working, so I will be switching over to that soon. I had previously been using Spearmint, and although it was working better than random search on this problem, using it was a huge pain and it seems bad to use software which isn't actively developed and with a non-permissive license... so, I look forward to re-running it with your BayOpt version!