facebook / Ax

Adaptive Experimentation Platform
https://ax.dev
MIT License
2.35k stars 303 forks source link

ALEBO: "RuntimeError: cholesky_cpu: U(1,1) is zero, singular U." #384

Closed pehzet closed 4 years ago

pehzet commented 4 years ago

I started testing the ALEBO Strategy with 5 real parameters and 20 dummy parameters (see code below), based on this quickstart . After a few iteration i get the following error. Does anyone know how to fix it? I have seen that "RuntimeError: cholesky_cpu: U(1,1) is zero, singular U." appears more often in other issues but i think this might be a different problem.

Traceback (most recent call last): File "C:\code\00_Testbereich\BO\testBo.py", line 101, in optMRP() File "C:\code\00_Testbereich\BO\testBo.py", line 78, in optMRP generation_strategy=alebo_strategy File "C:\code\00_Testbereich\env\lib\site-packages\ax\service\managed_loop.py", line 224, in optimize loop.full_run() File "C:\code\00_Testbereich\env\lib\site-packages\ax\service\managed_loop.py", line 166, in full_run self.run_trial() File "C:\code\00_Testbereich\env\lib\site-packages\ax\service\managed_loop.py", line 145, in run_trial experiment=self.experiment File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\generation_strategy.py", line 376, in gen keywords=get_function_argument_names(model.gen), File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\base.py", line 626, in gen model_gen_options=model_gen_options, File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\array.py", line 225, in _gen target_fidelities=target_fidelities, File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\torch.py", line 224, in _model_gen target_fidelities=target_fidelities, File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 657, in gen model_gen_options=model_gen_options, File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\botorch.py", line 367, in gen acf_options, File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 467, in ei_or_nei X_pending=X_pending, File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\botorch_defaults.py", line 171, in get_NEI inf_cost = get_infeasible_cost(X=X_observed, model=model, objective=obj_tf) File "C:\code\00_Testbereich\env\lib\site-packages\botorch\acquisition\utils.py", line 173, in get_infeasible_cost posterior = model.posterior(X) File "C:\code\00Testbereich\env\lib\site-packages\botorch\models\gpytorch.py", line 500, in posterior mvns = self(*[X for in range(self.num_outputs)]) File "C:\code\00_Testbereich\env\lib\site-packages\gpytorch\models\modellist.py", line 83, in call return [model.call(*args, kwargs) for model, args_ in zip(self.models, _get_tensor_args(args))] File "C:\code\00_Testbereich\env\lib\site-packages\gpytorch\models\model_list.py", line 83, in return [model.call(args, **kwargs) for model, args in zip(self.models, _get_tensor_args(*args))] File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 170, in call mvn = MultivariateNormal(mu, C) File "C:\code\00_Testbereich\env\lib\site-packages\gpytorch\distributions\multivariate_normal.py", line 49, in init super().init(loc=mean, covariance_matrix=covariance_matrix, validate_args=validate_args) File "C:\code\00_Testbereich\env\lib\site-packages\torch\distributions\multivariate_normal.py", line 149, in init self._unbroadcasted_scale_tril = torch.cholesky(covariance_matrix) RuntimeError: cholesky_cpu: U(1,1) is zero, singular U.

before the error occurs it shows the following warnings:

C:\code\00_Testbereich\env\lib\site-packages\torch\nn\modules\module.py:385: UserWarning:

The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

C:\code\00_Testbereich\env\lib\site-packages\gpytorch\distributions\multivariate_normal.py:229: NumericalWarning:

Negative variance values detected. This is likely due to numerical instabilities. Rounding negative variances up to 1e-10.

this is the code, excluding the evaluation function (which is a http request that only uses the real params):

def optMRP():
  params= [
    {
        "name": "real_param1",
        "type": "range",
        "bounds": [0.0, 20000.0],
        "value_type": "float"
        },
        {
        "name": "real_param2",
        "type": "range",
        "bounds": [0.0, 100.0],
        "value_type": "float"
        },
        {
        "name": "real_param3",
        "type": "range",
        "bounds": [0.0, 20000.0],
        "value_type": "float"
        },
        {
        "name": "real_param4",
        "type": "range",
        "bounds": [0.0, 20000.0],
        "value_type": "float"
        },
        {
        "name": "real_param5",
        "type" : "range",
        "bounds": [1.0,5.0],
        "value_type": "float"
        }
  ]
  for i in range(20):
    params.extend([
      {"name": f"x{i}", "type": "range", "bounds": [1.0, 100.0], "value_type": "float"}
    ])

  num_trials = 30
  D = 25
  d = 7
  init_size = 5
  alebo_strategy = ALEBOStrategy(D=D, d=d, init_size=init_size)

  best_parameters, values, experiment, model = optimize(
    parameters= params,
    experiment_name="test",
    objective_name="response1",
    minimize=True,
    outcome_constraints=["response2>= 0.99"],
    total_trials=num_trials,
    evaluation_function=evaluation,
    generation_strategy=alebo_strategy
  )
adamobeng commented 4 years ago

Hey, thanks for sharing this issue! This looks like a numeric stability problem. Would you be able to share the other error messages you've been seeing?

pehzet commented 4 years ago

hey Adam, thanks for your time! There is also this warning message: C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\utils.py:267: UserWarning:

This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at ..\torch\csrc\utils\python_arg_parser.cpp:766.)

bletham commented 4 years ago

@Pzmijewski everything looks good with how you're setting up the optimization. The error is a numerical issue coming from gpytorch where it is failing to do a cholesky decomposition, probably because the covariance matrix is nearly singular. A few things to try...

pehzet commented 4 years ago

hey @bletham,

`def evaluation(parameters):

#we only evalute the "real parameter". They get rounded because we normally use integers, but when we use ALEBO with int we get an Assertion Error directly
r1 = round(parameters["real_param1"])
r2 = round(parameters["real_param2"])
r3 = round(parameters["real_param3"])
r4 = round(parameters["real_param4"])
r5 = round(parameters["real_param5"])

resp1List = []
resp2List = []

#i build a loop to get enough data to fake a stochastic response
for i in range(15):

    resp1Det = ((30 * r1) + (15 * r2) + (20 * r3) + (3 * r4)) * (r5)

    resp1Stoch = random.uniform((resp1Det-250), (resp1Det+250))

    resp1List.append(resp1Stoch)

    resp2Det = (r1 / 10000) + (r2 / 400)

    resp2Stoch = random.uniform((resp2Det-0.02), (resp2Det+0.02))

    #resp2 cant be bigger than 1 in the real simulation but it is hard to build a suitable function that fast so i added this workaround
    if resp2Stoch >= 1:
      resp2Stoch = 1

    resp2List.append(resp2Stoch)

resp1Mean = sum(resp1List) / len(resp1List)
resp1SEM = statistics.stdev(resp1List)

resp2Mean = sum(resp2List) / len(resp2List)
resp2SEM = statistics.stdev(resp2List)

return {"response1": (resp1Mean, resp1SEM), "response2": (resp2Mean,resp2SEM)}`
bletham commented 4 years ago

Thanks for checking those things. If it's still erroring out with laplace_nsamp=1 then the issue isn't being caused by the Laplace approximation or posterior sampling, which I thought was the most likely way to produce a bad covariance matrix. It's something just in the core GP. Could you check into a few more things?

pehzet commented 4 years ago

hey @bletham, i tried to modify the noise level of the evaluation and the error did not occure anymore. It works when I inflate the SEM by a small value (resp2: +0.03). But it also works when the SEM of resp2 is 0.0 constantly. Resp1 has no influence of the error, I think, because there is constantly a large SEM. In real the SEM of resp2 is 0.0 in most of the trials. Maybe this produces a bad covariance matrix.

Regarding 'Repeated point': there are no multiple observations at the same point

Regarding 'Bad kernel estimation': i printed out the hyperparams (see ALEBO_kernel_hyperparams_error.txt). I hope this is what you meant.

bletham commented 4 years ago

Great, it looks like we're making some progress then.

There are two kernel hyperparameters for this model: an output scale, and a Mahalanobis distance matrix. The output scale in the hyperparameter dump you provided looks good. Taking the "Uvec" hyperparameter, we can compute the Mahalanobis distance matrix like so:

d = 7
shapeU = Uvec.shape[:-1] + torch.Size([d, d])
triu_indx = torch.triu_indices(d, d, device=Uvec.device)
U_t = torch.zeros(shapeU, dtype=Uvec.dtype, device=Uvec.device)
U_t[..., triu_indx[1], triu_indx[0]] = Uvec
M = (U_t.transpose(1, 2) @ U_t)

and then look at its smallest eigenvalue:

for i in range(25):
    print(torch.eig(M[i, :, :])[0][:, 0].min().item())

This prints the smallest eigvenvalue for each of the 25 posterior-sampled sets of hyperparameters as:

1.7575287471910963e-10
7.02167280627783e-10
8.421177058077722e-09
7.549683742133882e-07
5.180293883781498e-09
3.2218512837638487e-09
5.425286168003913e-10
5.012016770789702e-11
6.765819393375633e-08
4.685423905192713e-08
4.5551528211898307e-08
1.6915862733468524e-12
1.0319863044935401e-09
8.04074731025304e-08
6.461751881046865e-10
6.133606015970187e-10
1.7642966790309564e-13
5.692042377774931e-11
1.789688227452934e-11
7.770126041914509e-14
1.7877804084621364e-08
5.2172017581691255e-08
2.1131224924273813e-10
4.2630483362064594e-08
8.510738050865059e-10

These are all positive so the matrix is PD as it should be, but some of them are really close to 0. That may be leading to numerical issues downstream.

But it sounds like things are working smoothly with changes to the noise levels. Having a bit of noise is really helpful for numerical stability so we actually set a minimum noise level of 1e-7 even if the function is noiseless; that happens here: https://github.com/facebook/Ax/blob/a89fe684b0003edb17f1469d4e97055669d87629/ax/models/torch/alebo.py#L752

It sounds from your description like you do have an actual estimate of the noise level, but that inflating it a bit resolved the error here. The cost to the model performance from adding a little extra noise is pretty low, and well worth the improvement in model stability. A few more comments on the noise level -

Is the noise level being estimated from n repeated trials? If so, then just to be clear, the noise level the model expects is the standard error of the mean, so stdev(y_1, ..., y_n) / sqrt(n) (in the example evaluation function above you computed just the standard deviation). For n small, this may for some points underestimate the true variance (by using the sample standard deviation instead of the true, population standard deviation) and so could be why inflating is necessary for model stability. What is n for this function?

pehzet commented 4 years ago

Thanks for your help! I changed the standard deviation to standard error. In our original eval function we have n=30.

I will close the issue if its fine for you. The optimization is now running stable. Thank you very much for explaining these things. It was very helpful :)

bletham commented 4 years ago

Awesome, glad to hear it's working now.