facebook / Ax

Adaptive Experimentation Platform
https://ax.dev
MIT License
2.37k stars 310 forks source link

Restrict RangeParameter to a certain stepsize (or grid) #849

Closed ograsdijk closed 2 years ago

ograsdijk commented 2 years ago

For optimizing experiments in the lab I'd want to constrain my range parameters to have a certain stepsize (i.e. effectively do a grid optimization), since we know from experience that some experimental parameters need to be varied more to have an effect on whatever it is we're measuring. Measurements can also take a long time, so limiting the points we can optimize over also helps in that regard. After the coarser grid optimization we might want to zero in on a region of interest and do a finer grid optimization there.

I realize that restricting myself to a grid for optimization is kind of defeating the purpose of letting the model learn that some parameters aren't very sensitive, but I couldn't find another way to a-priori put this in (using the service API).

Before continuing with how I currently achieved this, my question is basically: Is there some more efficient method of restricting the search space to a grid of points with a RangeParameter, or am I limited to either maximizing the acquisition function over a grid of points or using a ChoiceParameter? The ChoiceParameter option doesn't give access to get_contour_plot or other plot methods. I can get prediction for the trained model, but these predictions (even for already measured points) don't match the measured values.

My current procedure

First I initialize the model with some pre-calculated points,

From #771 I was able figure out how to do this by generating the grid, and then running in a loop:

def evaluate(parameters): x = np.array(list(parameters.values())) return {"branin": (branin(x), None), "l2norm": (np.sqrt((x ** 2).sum()), None)}

gs = GenerationStrategy( steps=[ GenerationStep( # BayesOpt step model=Models.BOTORCH_MODULAR,

No limit on how many generator runs will be produced

        num_trials=-1,
        model_kwargs={  # Kwargs to pass to `BoTorchModel.__init__`
            "surrogate": Surrogate(SingleTaskGP),
            "botorch_acqf_class": qNoisyExpectedImprovement,
        },
    ),
]

)

ax_client = AxClient(generation_strategy=gs, verbose_logging = False)

ax_client.create_experiment( name="branin_test_experiment", parameters=[ { "name": "x1", "type": "range", "bounds": [-5., 10.0], "value_type": "float", }, { "name": "x2", "type": "range", "bounds": [0., 15.], "value_type": "float", } ], objective_name="branin", minimize=True, tracking_metric_names = ['l2norm'], )

seeding model with initial coarse grid search

nscan = 5 params_initial = [ np.linspace(p.lower, p.upper, nscan) for p in ax_client.experiment.search_space.parameters.values() ] params_initial = np.array(np.meshgrid(*params_initial)).T.reshape(-1, len(params_initial))

n_train = params_initial.shape[0] for i in range(n_train): p = dict(zip(ax_client.experiment.parameters.keys(), params_initial[i])) ax_client.attach_trial(p) ax_client.complete_trial(trial_index=i, raw_data=evaluate(p))

fitting the model, no access yet to fit model interface so using

trial failure method

_, trial_index = ax_client.get_next_trial() ax_client.log_trial_failure(trial_index)

generating the optimization grid

nscan = 15 values = [ np.linspace(p.lower, p.upper, nscan) for p in ax_client.experiment.search_space.parameters.values() ] params = np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

observation_features = [ ObservationFeatures(dict(zip(ax_client.experiment.parameters.keys(), val))) for val in params ]

for _ in range(25): model_bridge = copy.deepcopy(ax_client.generation_strategy.model)

acqf_values = [model_bridge.evaluate_acquisition_function(
        observation_features=[obs_feat],
        search_space = copy.deepcopy(ax_client.experiment.search_space)
    ) for obs_feat in observation_features]

idx_max = np.argmax(acqf_values)
p = observation_features[idx_max].parameters

trial_params = ax_client.get_trials_data_frame()[['x1', 'x2']].values.tolist()
# stopping if the suggested optimum was already checked
if list(p.values()) in trial_params:
    opt = model_bridge.predict([observation_features[np.argmax(acqf_values)]])[0]['branin'][0]
    print('already checked the new optimum', p, opt)
    break

ax_client.attach_trial(p)
ax_client.complete_trial(trial_index=ax_client.experiment.num_trials-1, raw_data=evaluate(p))
# fit model with trial failure method
_, trial_index = ax_client.get_next_trial()
ax_client.log_trial_failure(trial_index)
EugenHotaj commented 2 years ago

Would modeling your less sensitive parameters in log space be useful here? I would recommend that you do not artificially discretize your continuous parameters as this will generally lead to a much worse model fit.

Looking at your code, you can replace your manual grid sampling setup with Sobol sampling by including it as the first generation step in your GenerationStrategy. Sobol will approximately sample a lattice over your search space and you can increase/decrease the number of steps to make the lattice finer/coarser.

Also, it looks like in the second step you're manually trying to maximize the acquisition function on a grid of points. This is very suboptimal as the best acquisition value might lie somewhere between these points. I would suggest letting Ax optimize the acquisition function instead, e.g. by calling get_next_trial(), which will attempt to return the best point over your entire search space.

Finally, it's also possible that using a different model or acquisition function would work better for what you're looking for. cc @dme65 in case he has any modeling suggestions.

ograsdijk commented 2 years ago

Log space won't work for us as the less sensitive parameters aren't less sensitive in log space unfortunately.

We sometimes already have some data sampled over a coarse grid, hence my use of initializing on a grid instead of using Sobol. This is also something we might need to have for publications of results, where reviewers sometimes like to see more points of the sample space instead of just the optimization. I'll check out Sobol in more detail for the first generation step to see if it would suffice for this purpose; for purely optimizing I'd definitely just start with Sobel.

As to why I (perhaps foolishly) wanted to use a grid to optimize over: I realize that the actual optimum probably lies between grid points and that it would be more efficient to just let the GPEI (or other model) chose the points. However I didn't find another method to a priori put in information about how sensitive the model is to changes in certain parameters, and I figured a grid would be a decent approximation. In principle the model should just learn this by itself after a few iterations, but our measurements can be relatively time consuming (~hour) for a single trial, where multiple people have to monitor systems in the lab, so I wanted to find some way to cut down on the total measurements required for both optimization and exploration around the optimum. Maybe I should change the model to favor exploration a bit more to accomplish this without a grid?

I could of course just let it optimize until the improvement is below some threshold for some n trials in a row and then stop the optimization. And if I need more points around the optimum for publication purposes just generate a grid and measure those afterwards.

EugenHotaj commented 2 years ago

We sometimes already have some data sampled over a coarse grid, hence my use of initializing on a grid instead of using Sobol.

Make sense, if you already have access to this data without needing to evaluate your expensive function, attaching it directly is a good idea.

Maybe I should change the model to favor exploration a bit more to accomplish this without a grid?

Note that Expected Improvement (EI) does trade off between exploration and exploitation. In practice, EI will explore regions in the objective function which have high uncertainty before honing in on regions with high objective value. Optimizing the acquisition function on the grid might actually take more iterations to find the optimum since you're selecting a suboptimal point in each iteration (although given that your discretization is so coarse, this might not be the case).

I could of course just let it optimize until the improvement is below some threshold for some n trials in a row and then stop the optimization.

It looks like you're running 25 optimization iterations (+ the extra starting data), which for a 2D problem should give you relatively good results. Out of curiosity, have you compared the best point found by default Ax vs your grid approach after 25 trials each? Did you find Ax was significantly worse?

Finally, I believe it is possible to set custom GP kernel parameters, priors, etc, but you'd need to drop down to lower level APIs.

ograsdijk commented 2 years ago

It looks like you're running 25 optimization iterations (+ the extra starting data), which for a 2D problem should give you relatively good results. Out of curiosity, have you compared the best point found by default Ax vs your grid approach after 25 trials each? Did you find Ax was significantly worse?

Letting GPEI just run from the initialized grid in continuous parameter space for 25 iterations does give better slightly results in the same amount of iterations. My issue (which of course has nothing to do with finding the optimum itself) was just the trade off between having to then sample separately a grid to the area around the optimum a bit more, which was partially already accomplished by restricting the sample space to a grid. We are fine with a marginally less optimal parameter set.

However, given your comments I've decided to do the following:

Finally regarding the model parameters, in BoTorch a model like qUpperConfidenceBound has a beta parameter which controls how much exploration is done. However in the service API there is (as far as I can tell) no way to actually set this parameter. I can use the model with:

gs = GenerationStrategy(
    steps=[
        GenerationStep(  # BayesOpt step
            model=Models.BOTORCH_MODULAR,
            # No limit on how many generator runs will be produced
            num_trials=15,
            model_kwargs={  # Kwargs to pass to `BoTorchModel.__init__`
                "surrogate": Surrogate(SingleTaskGP),
                "botorch_acqf_class": qUpperConfidenceBound,
            },
        )
    ]
)

and I thought I could then add a value for beta within model_kwargs but that doesn't seem to be the case. Is there some way to accomplish this with the service API?

EugenHotaj commented 2 years ago

having to then sample separately a grid to the area around the optimum a bit more

Curious why you want to do this? Are you interested in (1) visualizing what the objective function looks like around the optimum or (2) just to squeeze out the last bit of performance.

For (1), you could use the model predictions directly and should get a reasonable visualization assuming the model fit is decent (in fact, this is how our contour plots work). For (2) it's probably a better idea to just use that extra budget to run the optimization for more iterations. You'll likely get better results as the model will eventually start exploiting and only try points around the optimum.

Another thing you could do is to first run your normal 25 trial iteration. After that, you can shrink the search space around the best point found so far and kick off another search with this smaller search space. You can warm start this search by including the points from the first iterations which fall within the search space. This will likely perform much better than the grid around the optimum.

and I thought I could then add a value for beta within model_kwargs

I believe what you want is actually acquisition_options, e.g.

model_kwargs={
    "surrogate": Surrogate(SingleTaskGP),
    "botorch_acqf_class": qUpperConfidenceBound,
    "acquisition_options": {"beta": <my-value>},
}
ograsdijk commented 2 years ago

Curious why you want to do this? Are you interested in (1) visualizing what the objective function looks like around the optimum or (2) just to squeeze out the last bit of performance.

Purely for visualizing. I know the performance will be better if the model just iterates on continuous parameter space. Personally I would just just use the model predictions to show the objective function around the optima (like in the contour plots) like you suggested, because indeed with enough iterations the model should be decent enough. However, it will probably be less of a hassle (with collaborators, reviewers, etc.) to have sufficient actual measured data around the optimum instead of predicted values. The model starting to exploit around the optimum might actually be enough for this though, I'll look into that.

For (2) it's probably a better idea to just use that extra budget to run the optimization for more iterations. You'll likely get better results as the model will eventually start exploiting and only try points around the optimum.

Yeah I totally agree, aside from restricting the parameter choice for less sensitive parameters the grid idea was mostly to get a balance between the amount of trials needed for optimization and the number of extra measurements required for the "real" data visualization. If it was purely about finding the optimum I'd just let the model iterate. (And perhaps dive deeper into the lower level APIs to specify kernel parameters and priors.)

I believe what you want is actually acquisition_options, e.g.

model_kwargs={
    "surrogate": Surrogate(SingleTaskGP),
    "botorch_acqf_class": qUpperConfidenceBound,
    "acquisition_options": {"beta": <my-value>},
}

That makes sense.

I think that's solved my initial question and subsequent follow ups, so thanks a lot!

Just for completeness, our real-life use cases are:

EugenHotaj commented 2 years ago

However, it will probably be less of a hassle (with collaborators, reviewers, etc.)

Makes sense to go with the grid then if that's what your collaborators/reviewers will want.

I think that's solved my initial question and subsequent follow ups, so thanks a lot!

Glad I could help! Going to close this out. Please feel free to reopen if necessary.

sgbaird commented 2 years ago

@ograsdijk would love to get the link to the preprint/publication when it's available!

eytan commented 2 years ago

Just for posterity: if you did want to search over a continuous space with a certain step size, I’d recommend using an integer-valued parameter in Ax, and then applying the appropriate multiplier / intercept in your function evaluation code. If you use the Choice data, Ax won’t be able to do the modeling as effectively. We also have ways of accounting for the rounding in the acquisition function which should work pretty well, and will work even better in the future.

@lena-kashtelyan maybe we can have a wishlist item for to adding step size for integer parameters. In the past I was not such a fan of this idea but it’s useful to keep tabs on user demand :) cc @sdaulton.

dme65 commented 2 years ago

if you did want to search over a continuous space with a certain step size, I’d recommend using an integer-valued parameter in Ax, and then applying the appropriate multiplier / intercept in your function evaluation code. If you use the Choice data, Ax won’t be able to do the modeling as effectively

These two options are equivalent with the standard Ax transforms. Choice parameters are transformed using an OrderChoiceEncode if they are specified as ordered and this transform maps the specified choices to 0, 1, ..., k. In the case of evenly spaced integers like {0, 4, 8, 12, 16} the options will just be transformed to {0, 1, 2, 3, 4}.

eytan commented 2 years ago

Indeed! Did anybody mention OrderedChoice? I might have missed that. This could be a good option as well!

On Sun, Apr 3, 2022 at 10:57 PM David Eriksson @.***> wrote:

if you did want to search over a continuous space with a certain step size, I’d recommend using an integer-valued parameter in Ax, and then applying the appropriate multiplier / intercept in your function evaluation code. If you use the Choice data, Ax won’t be able to do the modeling as effectively

These two options are equivalent with the standard Ax transforms. Choice parameters are transformed using an OrderChoiceEncode if they are specified as ordered and this transform maps the specified choices to 0, 1, ..., k. In the case of evenly spaced integers like {0, 4, 8, 12, 16} the options will just be transformed to {0, 1, 2, 3, 4}.

— Reply to this email directly, view it on GitHub https://github.com/facebook/Ax/issues/849#issuecomment-1087056981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAW34LDTDGJGUIKAUC7PFTVDJLALANCNFSM5QUR5GMQ . You are receiving this because you commented.Message ID: @.***>

Abrikosoff commented 5 months ago

I have a very similar (if not exactly the same!) problem in my use case: there could be several parameters in my search space which should only take values with certain stepsizes. Setting these to Choice, as suggested in 2371 is not feasible for me as that would preclude my applying other constrainst which are only available for RangeParameters. As an initial test I tried to run the code quoted in the original question, but that does not seem to achieve what it is claimed to do (returning specific parameters with discrete stepsizes). On the other hand, Eytan's comment

Just for posterity: if you did want to search over a continuous space with a certain step size, I’d recommend using an integer-valued parameter in Ax, and then applying the appropriate multiplier / intercept in your function evaluation code. If you use the Choice data, Ax won’t be able to do the modeling as effectively. We also have ways of accounting for the rounding in the acquisition function which should work pretty well, and will work even better in the future.

@lena-kashtelyan maybe we can have a wishlist item for to adding step size for integer parameters. In the past I was not such a fan of this idea but it’s useful to keep tabs on user demand :) cc @sdaulton.

seems to match closely my requirements, but I am not sure what he means by setting the multiplier / intercept in the evaluation code (since, naively I'd expect we want to transform the suggested parameter values).

My question is: is there a recommended way of ensuring suggestions only in discrete stepsizes (putting aside issues of optimizer performance and so on for the moment)? Thanks very much in advance!

sgbaird commented 5 months ago

@Abrikosoff fly-by comment, my understanding of that suggestion is that if you want 0 to 1 with a step size of 0.1, you would use an integer parameter with a range of 0 to 10 and then "behind the scenes" divide that parameter by 10 before running your objective function.

Balandat commented 5 months ago

@sgbaird that is correct.

Alternatively you can use a ChoiceParameter of the appropriate type and enumerate the choices (and make sure to indicate that is_ordered=True, which should be chosen by default if your parameter type is numerical and you have more than two choices).

Abrikosoff commented 5 months ago

@sgbaird Thank you Prof. Baird for your answer! Just to be sure, what this means is that I still classify this as a RangeParameter, but set it to take integer values, and then perform this "behind the scenes" operation before I feed it back into the optimizer?