GPflow / GPflowOpt

Bayesian Optimization using GPflow
Apache License 2.0
270 stars 61 forks source link

Batch Optimization #71

Open nrontsis opened 7 years ago

nrontsis commented 7 years ago

Hi,

I am really interested in the project and, since I am working on Batch Bayesian Optimisation, I would be keen on extending GPflowOpt for it.

Would you be interested in discussion and pull requests for this topic?

Thanks!

Nikitas

gpfins commented 7 years ago

We have been talking about batch optimization internally and think it would certainly be interesting to have (although not for the first version) and of course you are welcome to participate. Did you have any specific acquisition function in mind?

javdrher commented 7 years ago

Hi @nrontsis thank you for your interest in GPflowOpt!

Batch optimization is definitely on our list. In fact, I have some local code and some preliminary tests on it but it is not suited for a PR at all, furthermore I wasn't entirely sure if the things I did were all the right choices so I would be very interested to get a discussion going and work towards extending the framework. If you are willing to contribute thats great. Depending on the outcome of the discussion, the code could be used as a starting point if we decide that is the right way to go. If it is ok for you, I'll initiate the discussion but feel free to steer the discussion in the right direction.

A fundamental philosophy for GPflowOpt is that we want to provide a framework to allow easy implementation of BO strategies, rather than implementing a broad range of algorithms. When it comes to batch optimization it seems that there is a fundamental split between algorithms selecting a batch of N points sequentially (e.g. LP) or optimize the entire batch at once (e.g., q-KG, batch PES etc). Both approaches have quite different requirements in terms of implementation, so I have been in doubt what would be the best road for GPflowOpt. The latter seems more optimal but I don't know how feasible the optimization is as the optimization of the acquisition function can become high-dimensional. I would be very interested to hear your thoughts on this matter.

nrontsis commented 7 years ago

Thanks @nknudde @javdrher for the quick replies!

I believe that both approaches can be easily incorporated in the current framework.

Sequential optimisation is generally easier and that would be the vanilla choice for the practitioner. Moreover the ideas of, e.g. qEI-CL or LP can be applied to any acquisition function, allowing for a general solution. As a result we can start implementing sequential optimisation first and then move to the ''true'' batch case which has the challenges @javdrher mentioned (e.g. hard to optimise etc).

But either way, the first step would be to extend the framework to consider the acquisition function f(X) to be multipoint, i.e. accept multiple points at the same time. I think this is easy and would require minimal, clean changes.

What do you think? :)

javdrher commented 7 years ago

I agree that supporting both is possible and should fit within the design.

Your last remark for acquisition objects to be multi-point, I believe this is actually a requirement for the batch optimization rather than the sequential optimization? I actually think I already got that extension covered a while ago locally, are the changes in batch_support what you were thinking of? It replicates the domain for the optimizer, such that the candidate dimension becomes batch_size * dimension, then its already split so its easier to use it in Acquisition. If so, we could actually proceed by deciding on an acquisition to add to the framework itself and start with batch optimization itself.

For sequential optimization, I don't think this belongs in optimizer for a couple of reasons. Right now an optimizer is simply something which optimizes some function over a domain and nothing more. Making it accept only acquisition objects would change that completely and make them acquisition specific. This would make optimizers less applicable (i.e., in LP you could now easily use them to obtain the Lipschnitz estimate by optimizing the gradient norm over the domain). Additionally, what with the other optimizers (mc, candidates, bayesianoptimizer itself). Finally, together with some of the changes we are planning on the domain to implement discrete parameters I believe this will mix up several requirements in a single class.

The way I see it (though I might be missing something), in sequential mode the acquisition is optimized batch_size times over the same domain, but in a batch between each step something changes in the acquisition (penalizer, updating model with pseudo points etc.). So optimizer and its domain are the same, but acquisition is transformed. Furthermore, only acquisition knows exactly how to transform for a specific algorithm. We could add this logic partially in BayesianOptimizer by issuing a batch_update in an extra batch loop, but this morning I thought of something different:

with acquisition.batch(steps=batch_size) as batch:
   result = batch.optimize(self.optimizer)

the yielded batch object from the context can be generic and simply alternate between optimizing and calling a batch_update method on an acquisition function (which defaults to NotImplementedError and should be implemented in order to support sequential batch optimization). For generic approaches like LP, we have them wrap another acquisition and implement the batch_update in the LPAcquisition. The returned value of batch.optimize are the batch candidates for evaluation. Finally, when the context ends, the batch object should revert the acquisition to its original state so it can be updated the normal way with the obtained observations, ready for another round. This snippet would then go in BayesianOptimizer. Let me know what you think of this, I'd be happy to help out with its implementation if needed.

Last but not least, while browsing through the papers of the sequential batch algorithms there is often some form of more GP specific stuff going on to cope with the fact that no observations are available. This would imply proper checking if the model is actually a GPR, but it also means the fact that the scaling happens between acquisition and the model will pop up there. We have been talking about doing the scaling in Acquisition instead so code in build_acquisition is always operating on the same scale as the models, but we have been in doubt about it. This might be an extra argument to move the scaling. Don't worry about this for now.

nrontsis commented 7 years ago

@javdrher Thanks, I think I agree to all of your points. Sorry I didn't notice batch_support before. I will proceed in implementing the batch ''sequential'' case in a context manager fashion.

Will come back for more discussion.

nrontsis commented 7 years ago

I believe that this line should change to:

acq = self.build_acquisition(*tf.split(Xcand, num_or_size_splits=self.batch_size, axis=1))

But, more generally, why do we require the input X to be at least 2d? My understanding is that 1d would work even in the batch case, in the current implementation of batch_support.

javdrher commented 7 years ago

I pushed that branch this morning, it was a prototype on my laptop. You are right about the missing axis=1 parameter. I just found there are a few more required changes in my stash. I'll update the code.

The reason we currently take 2D as input is to allow evaluation of many candidates at the same time in parallel (e.g., evaluate 500 MC points on EI at once, then optimize the best candidate so far). This makes sense when the dimensionality grows to evaluate a large starting grid: because the acquisition is coded in tensorflow this computation is multi-core/can happen on GPU. Although we could have different semantics in the case of sequential batch support, I'd like to avoid to use the same construction in a different manner, for different use cases. An extra consideration here is that build_acquisition is a tensorflow method, depending on what kind of update is required on the model this can be achieved eaier outside of tf mode.

nrontsis commented 7 years ago

Hey, here are some thoughts after working a bit with the code:

I don't see a benefit of using a context manager for optimising a batch. It would require additional code, and I cannot see any use for __enter__ and __exit__. I think that using a function get_suggestion would be cleaner. In the base Acquisition class it would be something like:

def get_suggestion(self, optimizer):
        assert self.batch_size == 1
        return optimizer.optimize(self.inverse_acquisition)

while batch acquisition functions would overwrite this.

Moreover, I think it would be good to make the input tensor X rank 3. The first rank would be for the dimensionality of f, the second for the batch, and the third for parallel evaluations. There are a couple of reasons for this:

What do you think? Should I go on implementing it as I suggest? Thanks!

javdrher commented 7 years ago

I discussed this today with @icouckuy, and I'd like to make sure we are all on the same page. I'm sorry if this seems a bit slow but as the changes affect the framework quite a bit we want to proceed carefully and have requirements clear for everyone, so here is a rewind with some remarks towards your suggestions. Two general remarks:

Taking a small step away from the code, let's decompose both problems and map it to the architecture following the same principles we applied to come up with the design of the other parts of the framework.

I think this plan clearly reflects that both approaches are distinct: they can even be developed independently as there are no feature dependencies. Let us know what you think, know that we highly appreciate your input in the discussion as in only a couple of posts the contours of how this should look became much sharper. If this seems like a good plan, I can probably write the tests for the batch_support branch and solve its shortcomings and you could work in parallel on the sequential case. Then we should proceed and discuss which methods to provide as part of GPflowOpt itself.

nrontsis commented 7 years ago

Okay, that sounds like a solid plan, and I definitely understand you wanting to keep the framework clean.

So, if I understand you correctly:

Although it makes sense to see the batch and sequential cases as separate, I believe that we should think of how they will merge on the same branch [1]. This could be implemented as following. When batch_size > 1, the constructor of BayesianOptimizer checks the type of batch scenario. Something like:

if inspect.getfullargspec(self.acquisition.build_predict).vararg is not None:
    # Set flag/initialise for true batch optimization
elif hasattr(self.acquisition, 'set_batch'):
    # Set flag/initialise for sequential batch optimization
else:
    # Throw an exception

For example, this would allow us to set self.domain to domain or to batch_domain, and to determine if set_batch should be called in BayesianOptimizer._optimize.

If all that sounds good, I could work for the Sequential case on a new branch forked from master, with the following plan:

Should I proceed implementing the above and come back with a first PR for further discussion? As you suggested, you could work in batch_support independently.

[1]: The truth is that I'm biased on this: My work is on "true" batch optimisation, but I need to compare against the "vanilla" choice (i.e. "sequential" batch). So I prefer a framework that does both at the same time.

javdrher commented 7 years ago

Lets continue the design in the PR, thanks a lot. I agree on [1], right now we are in the process of releasing GPflowOpt 0.1 so we can not merge into master yet. It does make sense to merge our two branches together first and come up with a single batch branch which is then prepared for merging into master.