ALEBO: "RuntimeError: cholesky_cpu: U(1,1) is zero, singular U."

pehzet commented 4 years ago

I started testing the ALEBO Strategy with 5 real parameters and 20 dummy parameters (see code below), based on this quickstart . After a few iteration i get the following error. Does anyone know how to fix it? I have seen that "RuntimeError: cholesky_cpu: U(1,1) is zero, singular U." appears more often in other issues but i think this might be a different problem.

Traceback (most recent call last): File "C:\code\00_Testbereich\BO\testBo.py", line 101, in optMRP() File "C:\code\00_Testbereich\BO\testBo.py", line 78, in optMRP generation_strategy=alebo_strategy File "C:\code\00_Testbereich\env\lib\site-packages\ax\service\managed_loop.py", line 224, in optimize loop.full_run() File "C:\code\00_Testbereich\env\lib\site-packages\ax\service\managed_loop.py", line 166, in full_run self.run_trial() File "C:\code\00_Testbereich\env\lib\site-packages\ax\service\managed_loop.py", line 145, in run_trial experiment=self.experiment File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\generation_strategy.py", line 376, in gen keywords=get_function_argument_names(model.gen), File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\base.py", line 626, in gen model_gen_options=model_gen_options, File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\array.py", line 225, in _gen target_fidelities=target_fidelities, File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\torch.py", line 224, in _model_gen target_fidelities=target_fidelities, File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 657, in gen model_gen_options=model_gen_options, File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\botorch.py", line 367, in gen acf_options, File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 467, in ei_or_nei X_pending=X_pending, File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\botorch_defaults.py", line 171, in get_NEI inf_cost = get_infeasible_cost(X=X_observed, model=model, objective=obj_tf) File "C:\code\00_Testbereich\env\lib\site-packages\botorch\acquisition\utils.py", line 173, in get_infeasible_cost posterior = model.posterior(X) File "C:\code\00Testbereich\env\lib\site-packages\botorch\models\gpytorch.py", line 500, in posterior mvns = self(*[X for in range(self.num_outputs)]) File "C:\code\00_Testbereich\env\lib\site-packages\gpytorch\models\modellist.py", line 83, in call return [model.call(*args, kwargs) for model, args_ in zip(self.models, _get_tensor_args(args))] File "C:\code\00_Testbereich\env\lib\site-packages\gpytorch\models\model_list.py", line 83, in return [model.call(args, **kwargs) for model, args in zip(self.models, _get_tensor_args(*args))] File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 170, in call mvn = MultivariateNormal(mu, C) File "C:\code\00_Testbereich\env\lib\site-packages\gpytorch\distributions\multivariate_normal.py", line 49, in init super().init(loc=mean, covariance_matrix=covariance_matrix, validate_args=validate_args) File "C:\code\00_Testbereich\env\lib\site-packages\torch\distributions\multivariate_normal.py", line 149, in init self._unbroadcasted_scale_tril = torch.cholesky(covariance_matrix) RuntimeError: cholesky_cpu: U(1,1) is zero, singular U.

before the error occurs it shows the following warnings:

C:\code\00_Testbereich\env\lib\site-packages\torch\nn\modules\module.py:385: UserWarning:

The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

C:\code\00_Testbereich\env\lib\site-packages\gpytorch\distributions\multivariate_normal.py:229: NumericalWarning:

Negative variance values detected. This is likely due to numerical instabilities. Rounding negative variances up to 1e-10.

this is the code, excluding the evaluation function (which is a http request that only uses the real params):

def optMRP():
  params= [
    {
        "name": "real_param1",
        "type": "range",
        "bounds": [0.0, 20000.0],
        "value_type": "float"
        },
        {
        "name": "real_param2",
        "type": "range",
        "bounds": [0.0, 100.0],
        "value_type": "float"
        },
        {
        "name": "real_param3",
        "type": "range",
        "bounds": [0.0, 20000.0],
        "value_type": "float"
        },
        {
        "name": "real_param4",
        "type": "range",
        "bounds": [0.0, 20000.0],
        "value_type": "float"
        },
        {
        "name": "real_param5",
        "type" : "range",
        "bounds": [1.0,5.0],
        "value_type": "float"
        }
  ]
  for i in range(20):
    params.extend([
      {"name": f"x{i}", "type": "range", "bounds": [1.0, 100.0], "value_type": "float"}
    ])

  num_trials = 30
  D = 25
  d = 7
  init_size = 5
  alebo_strategy = ALEBOStrategy(D=D, d=d, init_size=init_size)

  best_parameters, values, experiment, model = optimize(
    parameters= params,
    experiment_name="test",
    objective_name="response1",
    minimize=True,
    outcome_constraints=["response2>= 0.99"],
    total_trials=num_trials,
    evaluation_function=evaluation,
    generation_strategy=alebo_strategy
  )

adamobeng commented 4 years ago

Hey, thanks for sharing this issue! This looks like a numeric stability problem. Would you be able to share the other error messages you've been seeing?

pehzet commented 4 years ago

hey Adam, thanks for your time! There is also this warning message: C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\utils.py:267: UserWarning:

This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at ..\torch\csrc\utils\python_arg_parser.cpp:766.)

bletham commented 4 years ago

@Pzmijewski everything looks good with how you're setting up the optimization. The error is a numerical issue coming from gpytorch where it is failing to do a cholesky decomposition, probably because the covariance matrix is nearly singular. A few things to try...

Can you check that you are using Ax version 0.1.14? That version contains a bugfix for ALEBO that could affect the stability of the Laplace approximation used to get the posterior.
By default the method integrates over posterior samples of the kernel parameters by using moment-matching to produce a Normal posterior. The moment-matched covariance matrix is the one producing the error. Maybe you can try running the optimization without the posterior sampling and moment matching and see if that is more stable or if it also produces the error? you would do this by
```
alebo_strategy = ALEBOStrategy(D=D, d=d, init_size=init_size, gp_kwargs={'laplace_nsamp': 1})
```
Does it fail pretty reliably? Is there any chance you are able to share the evaluation function so I could try to replicate the issue?

pehzet commented 4 years ago

hey @bletham,

yes, i´m using ax 0.1.14
i tried the ''laplace_nsamp': 1'-fix, but it didn´t helped
i cant give you the original evaluation function because its a HTTP API Request to a simulation from another company including sensitive information. I build a fake one with nearly similar results but much less complex. Maybe it helps you. While testing the fake one i got the same warnings but the critical error did not occur. With the original eval function it fails all the time.

`def evaluation(parameters):

#we only evalute the "real parameter". They get rounded because we normally use integers, but when we use ALEBO with int we get an Assertion Error directly
r1 = round(parameters["real_param1"])
r2 = round(parameters["real_param2"])
r3 = round(parameters["real_param3"])
r4 = round(parameters["real_param4"])
r5 = round(parameters["real_param5"])

resp1List = []
resp2List = []

#i build a loop to get enough data to fake a stochastic response
for i in range(15):

    resp1Det = ((30 * r1) + (15 * r2) + (20 * r3) + (3 * r4)) * (r5)

    resp1Stoch = random.uniform((resp1Det-250), (resp1Det+250))

    resp1List.append(resp1Stoch)

    resp2Det = (r1 / 10000) + (r2 / 400)

    resp2Stoch = random.uniform((resp2Det-0.02), (resp2Det+0.02))

    #resp2 cant be bigger than 1 in the real simulation but it is hard to build a suitable function that fast so i added this workaround
    if resp2Stoch >= 1:
      resp2Stoch = 1

    resp2List.append(resp2Stoch)

resp1Mean = sum(resp1List) / len(resp1List)
resp1SEM = statistics.stdev(resp1List)

resp2Mean = sum(resp2List) / len(resp2List)
resp2SEM = statistics.stdev(resp2List)

return {"response1": (resp1Mean, resp1SEM), "response2": (resp2Mean,resp2SEM)}`

bletham commented 4 years ago

Thanks for checking those things. If it's still erroring out with laplace_nsamp=1 then the issue isn't being caused by the Laplace approximation or posterior sampling, which I thought was the most likely way to produce a bad covariance matrix. It's something just in the core GP. Could you check into a few more things?

Noise level: as you seem to know from the code above, the current ALEBO implementation does not support estimating the noise level and it expects an accurate noise level to be passed in the evaluation function, as is being done in your example code. If the noise level were badly underestimated, it's possible that could produce a bad covariance matrix. Are you pretty confident in the noise level you're using in the real evaluation function? If you inflate it (say maybe 2x) does that improve stability at all?
Repeated points: this seems really unlikely given the parameters are all continuous and the nature of the embedding, but I've seen in the past that gpytorch can produce cholesky errors when there are multiple observations at the same point. If you look at the set of evaluations when the error is raised, is that happening here? Maybe because the model is querying some corner of the space multiple times?
Bad kernel estimation: the model doesn't currently have any regularization or priors on the kernel hyperparameters. I'm wondering if some of them are being badly fit and producing the error, and if maybe there is some regularization that would be appropriate for avoiding a bad part of the parameter space. Would it be possible to print out
```
list(self.named_parameters())
```
at this level of the stack trace above:
```
File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 170, in call
mvn = MultivariateNormal(mu, C)
```
to get the ALEBOGP kernel hyperparameters at the time of the crash? I hope that's clear what I mean. This would be best with laplace_nsamp=1 (since it will have fewer parameters).

pehzet commented 4 years ago

hey @bletham, i tried to modify the noise level of the evaluation and the error did not occure anymore. It works when I inflate the SEM by a small value (resp2: +0.03). But it also works when the SEM of resp2 is 0.0 constantly. Resp1 has no influence of the error, I think, because there is constantly a large SEM. In real the SEM of resp2 is 0.0 in most of the trials. Maybe this produces a bad covariance matrix.

Regarding 'Repeated point': there are no multiple observations at the same point

Regarding 'Bad kernel estimation': i printed out the hyperparams (see ALEBO_kernel_hyperparams_error.txt). I hope this is what you meant.

bletham commented 4 years ago

Great, it looks like we're making some progress then.

There are two kernel hyperparameters for this model: an output scale, and a Mahalanobis distance matrix. The output scale in the hyperparameter dump you provided looks good. Taking the "Uvec" hyperparameter, we can compute the Mahalanobis distance matrix like so:

d = 7
shapeU = Uvec.shape[:-1] + torch.Size([d, d])
triu_indx = torch.triu_indices(d, d, device=Uvec.device)
U_t = torch.zeros(shapeU, dtype=Uvec.dtype, device=Uvec.device)
U_t[..., triu_indx[1], triu_indx[0]] = Uvec
M = (U_t.transpose(1, 2) @ U_t)

and then look at its smallest eigenvalue:

for i in range(25):
    print(torch.eig(M[i, :, :])[0][:, 0].min().item())

This prints the smallest eigvenvalue for each of the 25 posterior-sampled sets of hyperparameters as:

1.7575287471910963e-10
7.02167280627783e-10
8.421177058077722e-09
7.549683742133882e-07
5.180293883781498e-09
3.2218512837638487e-09
5.425286168003913e-10
5.012016770789702e-11
6.765819393375633e-08
4.685423905192713e-08
4.5551528211898307e-08
1.6915862733468524e-12
1.0319863044935401e-09
8.04074731025304e-08
6.461751881046865e-10
6.133606015970187e-10
1.7642966790309564e-13
5.692042377774931e-11
1.789688227452934e-11
7.770126041914509e-14
1.7877804084621364e-08
5.2172017581691255e-08
2.1131224924273813e-10
4.2630483362064594e-08
8.510738050865059e-10

These are all positive so the matrix is PD as it should be, but some of them are really close to 0. That may be leading to numerical issues downstream.

But it sounds like things are working smoothly with changes to the noise levels. Having a bit of noise is really helpful for numerical stability so we actually set a minimum noise level of 1e-7 even if the function is noiseless; that happens here: https://github.com/facebook/Ax/blob/a89fe684b0003edb17f1469d4e97055669d87629/ax/models/torch/alebo.py#L752

It sounds from your description like you do have an actual estimate of the noise level, but that inflating it a bit resolved the error here. The cost to the model performance from adding a little extra noise is pretty low, and well worth the improvement in model stability. A few more comments on the noise level -

Is the noise level being estimated from n repeated trials? If so, then just to be clear, the noise level the model expects is the standard error of the mean, so stdev(y_1, ..., y_n) / sqrt(n) (in the example evaluation function above you computed just the standard deviation). For n small, this may for some points underestimate the true variance (by using the sample standard deviation instead of the true, population standard deviation) and so could be why inflating is necessary for model stability. What is n for this function?

pehzet commented 4 years ago

Thanks for your help! I changed the standard deviation to standard error. In our original eval function we have n=30.

I will close the issue if its fine for you. The optimization is now running stable. Thank you very much for explaining these things. It was very helpful :)

bletham commented 4 years ago

Awesome, glad to hear it's working now.

facebook / Ax

ALEBO: "RuntimeError: cholesky_cpu: U(1,1) is zero, singular U." #384