facebook / Ax

Adaptive Experimentation Platform
https://ax.dev
MIT License
2.38k stars 311 forks source link

[Sobol fallback needed] Repeated trials in experiment (and numerical errors they sometimes cause: `RuntimeError: cholesky_cpu: U(63,63) is zero, singular U.`) #228

Closed covrig closed 4 months ago

covrig commented 4 years ago

I have a question regarding repeating trials. To my code I am not adding any new option compared to the _Service API Example on Hartmann6_ tutorial. However, I am changing the objective function and the parameters. Checking the results, I keep noticing that some runs produce lot of repeating trials. Am I missing something? Did it converge? If yes, how do I stop the repetition? By breaking the loop when the trial is identical to the previous one? Thanks.

for i in range(20):
    parameters, trial_index = ax.get_next_trial()
    ax.complete_trial(trial_index=trial_index, raw_data=evaluate(parameters))

Results: ax.get_trials_data_frame().sort_values('trial_index')

arm_name    MAE trial_index x1  x2  x3  x4
0   0_0 0.354344    0   3   29  56  a
3   1_0 0.392026    1   21  26  34  b
12  2_0 0.366922    2   15  88  67  a
13  3_0 0.395405    3   40  83  24  b
14  4_0 0.360699    4   7   8   60  a
15  5_0 0.36654     5   1   27  66  b
16  6_0 0.360878    6   1   35  47  b
17  7_0 0.354756    7   4   21  54  b
18  8_0 0.352988    8   4   29  55  b
19  9_0 0.355494    9   7   37  56  b
1   10_0    0.35465     10  5   31  54  b
2   11_0    0.366325    11  6   38  66  b
4   12_0    0.359888    12  5   27  57  b
5   13_0    0.351413    13  1   43  55  b
6   13_0    0.351413    14  1   43  55  b
7   13_0    0.351413    15  1   43  55  b
8   13_0    0.351413    16  1   43  55  b
9   13_0    0.351413    17  1   43  55  b
10  13_0    0.351413    18  1   43  55  b
11  13_0    0.351413    19  1   43  55  b

Note: x4 is not used in the function.

winf-hsos commented 4 years ago

It looks similar to an issue I just reported. My guess is that the arms are in fact only identical because your parameters are discrete and are therefore rounded to the next integer. Behind the scenes, BO uses real values and they are different for each arm.

I am also interested in a solution to this problem.

Thanks Nicolas

covrig commented 4 years ago

Not sure about that. These are my parameters ("type": "range"). The choice one is not used in the optimization.

    parameters=[
          {
            "name": "x2",
            "type": "range",
            "bounds": [5, 100],
          },
          {
            "name": "x3",
            "type": "range",
            "bounds": [7, 75],
          },
          {
            "name": "x4",
            "type": "choice",
            "values": ['a', 'b'],
          },
          {
            "name": "x1",
            "type": "range",
            "bounds": [1, 40],
            #"value_type": "float",
          },
    ],
winf-hsos commented 4 years ago

I think that bounds indicate a discrete parameter. If you use [7.0, 75.0] instead, it'll be continuous.

showgood163 commented 4 years ago

Can confirm. I'm on python 3.8 w/ ax 0.1.6 and pytorch 1.3.1. I also tried to abandon trials which seems to be the duplicate of previously shown parameters and generate new trials with model.SOBEL. Then the experiment quickly fails after several trials w/ the error NaNs encounterd when trying to perform matrix-vector multiplication.

winf-hsos commented 4 years ago

Can confirm. I also tried to abandon trials which seems to be the duplicate of previously shown parameters and generate new trials with model.SOBEL. Then the experiment quickly fails after several trials w/ the error NaNs encounterd when trying to perform matrix-vector multiplication.

Can you post the code how you sneaked in random trials with sobol? I tried but did not get it to work.

However, this is also not an optimal solution. It would be better to not generate duplicates in the first place. Is there a way how we can efficiently achieve that?

Great platform btw, best for BO and Python I know of 💪👍

Thx

winf-hsos commented 4 years ago

An update from my side. I let my model run overnight to see whether it will stop suggesting the same points (which btw quite often but not always do also violate a constraint I defined, so they shouldn't even be suggested). While BO stops suggesting the same points many time after 4, 14, 54 times, it finally crashes after 17 suggestions of another point, and I get this message, which is the same as you get @showgood163 :

NaNs encounterd when trying to perform matrix-vector multiplication

NOTE: I am using Ax with a local webserver implemented in Flask. I need this because my objective function is a simulation tool that does not understand any Python.

The full stack trace if that helps:

... [ some Flask errors stack before that ]
  File "server.py", line 146, in suggest_point
    parameters, current_trial_index = ax.get_next_trial()
  File "C:\code\BOFlaskAx\env\lib\site-packages\ax\service\ax_client.py", line 240, in get_next_trial
    trial = self.experiment.new_trial(generator_run=self._gen_new_generator_run())
  File "C:\code\BOFlaskAx\env\lib\site-packages\ax\service\ax_client.py", line 715, in _gen_new_generator_run
    experiment=self.experiment
  File "C:\code\BOFlaskAx\env\lib\site-packages\ax\modelbridge\generation_strategy.py", line 229, in gen
    self._model.update(experiment=experiment, data=new_data)
  File "C:\code\BOFlaskAx\env\lib\site-packages\ax\modelbridge\base.py", line 455, in update
    self._update(observation_features=obs_feats, observation_data=obs_data)
  File "C:\code\BOFlaskAx\env\lib\site-packages\ax\modelbridge\array.py", line 123, in _update
    self._model_update(Xs=Xs_array, Ys=Ys_array, Yvars=Yvars_array)
  File "C:\code\BOFlaskAx\env\lib\site-packages\ax\modelbridge\torch.py", line 120, in _model_update
    self.model.update(Xs=Xs, Ys=Ys, Yvars=Yvars)
  File "C:\code\BOFlaskAx\env\lib\site-packages\ax\models\torch\botorch.py", line 399, in update
    refit_model=self.refit_on_update,
  File "C:\code\BOFlaskAx\env\lib\site-packages\ax\models\torch\botorch_defaults.py", line 123, in get_and_fit_model
    mll = fit_gpytorch_model(mll, bounds=bounds)
  File "C:\code\BOFlaskAx\env\lib\site-packages\botorch\fit.py", line 98, in fit_gpytorch_model
    mll, _ = optimizer(mll, track_iterations=False, **kwargs)
  File "C:\code\BOFlaskAx\env\lib\site-packages\botorch\optim\fit.py", line 210, in fit_gpytorch_scipy
    callback=cb,
  File "C:\code\BOFlaskAx\env\lib\site-packages\scipy\optimize\_minimize.py", line 600, in minimize
    callback=callback, **options)
  File "C:\code\BOFlaskAx\env\lib\site-packages\scipy\optimize\lbfgsb.py", line 335, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "C:\code\BOFlaskAx\env\lib\site-packages\scipy\optimize\lbfgsb.py", line 285, in func_and_grad
    f = fun(x, *args)
  File "C:\code\BOFlaskAx\env\lib\site-packages\scipy\optimize\optimize.py", line 327, in function_wrapper
    return function(*(wrapper_args + args))
  File "C:\code\BOFlaskAx\env\lib\site-packages\scipy\optimize\optimize.py", line 65, in __call__
    fg = self.fun(x, *args)
  File "C:\code\BOFlaskAx\env\lib\site-packages\botorch\optim\fit.py", line 268, in _scipy_objective_and_grad
    raise e  # pragma: nocover
  File "C:\code\BOFlaskAx\env\lib\site-packages\botorch\optim\fit.py", line 263, in _scipy_objective_and_grad
    loss = -mll(*args).sum()
  File "C:\code\BOFlaskAx\env\lib\site-packages\gpytorch\module.py", line 22, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "C:\code\BOFlaskAx\env\lib\site-packages\gpytorch\mlls\exact_marginal_log_likelihood.py", line 27, in forward
    res = output.log_prob(target)
  File "C:\code\BOFlaskAx\env\lib\site-packages\gpytorch\distributions\multivariate_normal.py", line 128, in log_prob
    inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
  File "C:\code\BOFlaskAx\env\lib\site-packages\gpytorch\lazy\lazy_tensor.py", line 1052, in inv_quad_logdet
    *args,
  File "C:\code\BOFlaskAx\env\lib\site-packages\gpytorch\functions\_inv_quad_log_det.py", line 112, in forward
    solves, t_mat = lazy_tsr._solve(rhs, preconditioner, num_tridiag=num_random_probes)
  File "C:\code\BOFlaskAx\env\lib\site-packages\gpytorch\lazy\lazy_tensor.py", line 641, in _solve
    preconditioner=preconditioner,
  File "C:\code\BOFlaskAx\env\lib\site-packages\gpytorch\utils\linear_cg.py", line 162, in linear_cg
    raise RuntimeError("NaNs encounterd when trying to perform matrix-vector multiplication")
RuntimeError: NaNs encounterd when trying to perform matrix-vector multiplication
showgood163 commented 4 years ago

@winf-hsos

Can you post the code how you sneaked in random trials with sobol? I tried but did not get it to work.

Here's a dirty hack code of ax.client. I simply add a sobel generation strategy to the ax.client. You can directly run the code to see the effect.

it finally crashes after 17 suggestions of another point

The same. However, I don't know how to dig into this problem. This is the reason I want to reuse previous experiment results.

lena-kashtelyan commented 4 years ago

@covrig, @winf-hsos: when defining a parameter like so, without providing value_type:

{
   "name": "x1",
   "type": "range",
   "bounds": [1, 40],
   # "value_type": "float", –– Need to specify this if using int bounds but want float values
},

the value type is inferred from the type of the bounds, which in your case is int, which makes your parameters defacto discrete. However, Bayesian optimization operates on continuous domains, and when the configurations it suggests are rounded to ints, different configurations end up the same, which @winf-hsos correctly explained.

I will be back shortly with a proposed solution to the issue!

P.S.: @showgood163, if you are just looking to use Sobol generation strategy instead of Bayesian optimization, you can pass choose_generation_strategy_kwargs={"no_bayesian_optimization": True} to AxClient.create_experiment, which will force the generation strategy for your optimization to be quasi-random.

covrig commented 4 years ago

@lena-kashtelyan : Thanks. The documentation is clear on the # "value_type": "float". I should have removed it before posting it (it's commented out in my code, I was using it for debugging). Going back to my results table, starting with trial_index=13 all the parameters are identical. From what I read, they are not ints they are just rounded. Is that right?

I am stopping the repetition using the code below. Is it a bad idea?

#Run optimization loop
old_parameters = None
for i in range(no_trials):
    parameters, trial_index = ax.get_next_trial()
    print(f"Running trial {trial_index+1}/{no_trials}...")
    if old_parameters != parameters:        
        ax.complete_trial(trial_index=trial_index, raw_data=evaluate(parameters))        
    else:
        print(f"Trial {trial_index+1}/{no_trials} is abandoned due to repetition!"')
        ax.experiment.trials[trial_index].mark_abandoned()
    old_parameters = parameters
lena-kashtelyan commented 4 years ago

@covrig, the answer to whether your stopping logic will work for your use case is, unfortunately, "it depends." If 1) we treat the observations obtained from trial evaluation as noiseless, and 2) you are okay with the risk of stopping at some local and not necessarily global optimal parameter values, then we can use that stopping logic.

However, if the observations are noisy and you can afford to continue running the optimization for some more trials, then the right thing to do would be to continue running more. It seems that at the moment that creates a case where once you end up with a lot of repeated trials, we start having numerical issues (NaNs encountered when trying to perform matrix-vector multiplication error). We are looking into how we can better avoid that.

A better alternative might be not to stop when you get a repeated trial, but continue getting new trials (without completing them) until you get a new one (with some limit, of course, at which you can just stop the whole optimization).

P.S.: If you can afford to just exhaust your search space or get close to doing so (which in your case you probably cannot, since you would be running half a million trials), it would be reasonable to just use Sobol generation strategy instead of Bayesian optimization –– I described how to do so in https://github.com/facebook/Ax/issues/230.

covrig commented 4 years ago

Thanks for the reply. You gave several options to investigate and #230 looks interesting. I am not sure why you keep the ints around considering the continuous domains. It's a bit confusing.

covrig commented 4 years ago

@lena-kashtelyan Just a silly question. It looks like the service API is much faster than the Loop API for identical optimizations. I didn't notice anything running in parallel on the Service API using default parameters. Am I missing something? Which API would you recommend as a default? I would tend to say the Loop API, however I have no explanation for the longer processing times since both API's are referred to be synchronous (with default options, like in the standard tutorials).

showgood163 commented 4 years ago

@lena-kashtelyan Thank you for pointing out that! However, in the above code snippet, what I need is a conditional generate strategy which falls back to random when BO generate seemingly identical trial, so I think an init param like that may not solve the problem here.

lena-kashtelyan commented 4 years ago

@covrig, re: keeping ints around for continuous ranges –– simply because in some cases folks need the parameters to take integer values only. We might reconsider if enough people find in confusing, though!

Re: Service API being faster –– that is unexpected. I would be curious to learn how you measured the runtimes there (since Loop API runs the evaluation function within the optimization and the Service API does not). If you'd like us to look into it, please open a separate issue with your code snippets. Thank you : )

Re: default API –– it really depends on the use case, there is no inherent reason to prefer one to another. Not sure if I can put it much better than our APIs doc.

@showgood163, gotcha! It seemed like you were doing something fancier than just forcing Sobol, but I just wanted to show the easier way of forcing it anyway, in case it comes in handy. Thank you for being a power user of Ax and providing us with helpful feedback!

lena-kashtelyan commented 4 years ago

Update: we had a team discussion around this, and here are the outcomes:

1) We are working on making our modeling layer more robust to the numerical errors caused by data logged for many repeated trials and will update this issue when the fix for that is in. 2) We are also considering adding an option to specify experiment-level trial deduplication (currently we only offer deduplication for Sobol, but not for BayesOpt). 3) In the meantime, when encountering identical or very similar trials, you can skip over repeated trials and eventually use them as a stopping criterion, as described in my commend above:

A better alternative might be not to stop when you get a repeated trial, but continue getting new trials (without completing them) until you get a new one (with some limit, of course, at which you can just stop the whole optimization). 4) If standard errors on observations are known (or is the observations are known to be noiseless, in which case standard errors are 0.0), it's better to actually pass those values to Ax, since in that case Ax will not need to infer noise level and numerical errors will be less likely to crop up.

cc @Balandat, @eytan, @bletham, @ldworkin

covrig commented 4 years ago

@lena-kashtelyan

We might reconsider if enough people find in confusing, though!

It's just a matter of clarifying the online documentation since it is confusing (e.g. Core - If the parameter is specified as an int, newly generated points are rounded to the nearest integer by default.)

Re: Service API being faster –– that is unexpected. I would be curious to learn how you measured the runtimes there (since Loop API runs the evaluation function within the optimization and the Service API does not). If you'd like us to look into it, please open a separate issue with your code snippets. Thank you : )

This was a mistake on my side. I combine Ax with Facebook Prophet for a forecast optimization and the loop API was unlucky with the initial Sobol steps.

~A question. I have a CPU with 44 threads. Using the service API, I could evaluate 44 Sobol steps in parallel (multiprocessing). Then I could continue with another sequential 10-15 steps for the Bayesian optimization. Do you think the large number of Sobol steps are somehow skewing my result?~

lena-kashtelyan commented 4 years ago

@covrig, please open a separate issue for the large amount of Sobol steps question, since we do want these issues to be easily discoverable by others with similar questions!

Noted regarding the int-type documentation.

lena-kashtelyan commented 4 years ago

Discussion in https://github.com/facebook/Ax/issues/381 is relevant for this issue, too.

sdaulton commented 4 years ago

We have some improved methods in the works for better treatment of integer parameters (in the next couple weeks), which should resolve this issue. cc: @Balandat

ghost commented 3 years ago

hi, any updates on this?

ldworkin commented 3 years ago

Hi @blenderben2 ! I don't think there's anything new that's been shipped yet, but we should be able to help you figure out a workaround based on your particular use case -- can you provide more details about what kind of optimization you're running, and when you're seeing this error?

Without knowing more, the best solution is probably what @lena-kashtelyan suggested above:

In the meantime, when encountering identical or very similar trials, you can skip over repeated trials and eventually use them as a stopping criterion.

In other words, if you're hitting this error because we're generating many repeated trials, it probably means the optimization is complete, and we've found the best point that we can.

lena-kashtelyan commented 1 year ago

This also should be resolved with the Sobol fallback that @saitcakmak is planning to work on soon, so will assign this to him as well.

saitcakmak commented 4 months ago

Closing this as inactive. In the current version, AxClient deduplicates candidates by default, so repeated candidates should not happen (except for FAILED trials).