Closed pehzet closed 4 years ago
Hey, thanks for sharing this issue! This looks like a numeric stability problem. Would you be able to share the other error messages you've been seeing?
hey Adam, thanks for your time! There is also this warning message: C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\utils.py:267: UserWarning:
This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at ..\torch\csrc\utils\python_arg_parser.cpp:766.)
@Pzmijewski everything looks good with how you're setting up the optimization. The error is a numerical issue coming from gpytorch where it is failing to do a cholesky decomposition, probably because the covariance matrix is nearly singular. A few things to try...
alebo_strategy = ALEBOStrategy(D=D, d=d, init_size=init_size, gp_kwargs={'laplace_nsamp': 1})
hey @bletham,
`def evaluation(parameters):
#we only evalute the "real parameter". They get rounded because we normally use integers, but when we use ALEBO with int we get an Assertion Error directly
r1 = round(parameters["real_param1"])
r2 = round(parameters["real_param2"])
r3 = round(parameters["real_param3"])
r4 = round(parameters["real_param4"])
r5 = round(parameters["real_param5"])
resp1List = []
resp2List = []
#i build a loop to get enough data to fake a stochastic response
for i in range(15):
resp1Det = ((30 * r1) + (15 * r2) + (20 * r3) + (3 * r4)) * (r5)
resp1Stoch = random.uniform((resp1Det-250), (resp1Det+250))
resp1List.append(resp1Stoch)
resp2Det = (r1 / 10000) + (r2 / 400)
resp2Stoch = random.uniform((resp2Det-0.02), (resp2Det+0.02))
#resp2 cant be bigger than 1 in the real simulation but it is hard to build a suitable function that fast so i added this workaround
if resp2Stoch >= 1:
resp2Stoch = 1
resp2List.append(resp2Stoch)
resp1Mean = sum(resp1List) / len(resp1List)
resp1SEM = statistics.stdev(resp1List)
resp2Mean = sum(resp2List) / len(resp2List)
resp2SEM = statistics.stdev(resp2List)
return {"response1": (resp1Mean, resp1SEM), "response2": (resp2Mean,resp2SEM)}`
Thanks for checking those things. If it's still erroring out with laplace_nsamp=1 then the issue isn't being caused by the Laplace approximation or posterior sampling, which I thought was the most likely way to produce a bad covariance matrix. It's something just in the core GP. Could you check into a few more things?
Noise level: as you seem to know from the code above, the current ALEBO implementation does not support estimating the noise level and it expects an accurate noise level to be passed in the evaluation function, as is being done in your example code. If the noise level were badly underestimated, it's possible that could produce a bad covariance matrix. Are you pretty confident in the noise level you're using in the real evaluation function? If you inflate it (say maybe 2x) does that improve stability at all?
Repeated points: this seems really unlikely given the parameters are all continuous and the nature of the embedding, but I've seen in the past that gpytorch can produce cholesky errors when there are multiple observations at the same point. If you look at the set of evaluations when the error is raised, is that happening here? Maybe because the model is querying some corner of the space multiple times?
Bad kernel estimation: the model doesn't currently have any regularization or priors on the kernel hyperparameters. I'm wondering if some of them are being badly fit and producing the error, and if maybe there is some regularization that would be appropriate for avoiding a bad part of the parameter space. Would it be possible to print out
list(self.named_parameters())
at this level of the stack trace above:
File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 170, in call
mvn = MultivariateNormal(mu, C)
to get the ALEBOGP
kernel hyperparameters at the time of the crash? I hope that's clear what I mean. This would be best with laplace_nsamp=1
(since it will have fewer parameters).
hey @bletham, i tried to modify the noise level of the evaluation and the error did not occure anymore. It works when I inflate the SEM by a small value (resp2: +0.03). But it also works when the SEM of resp2 is 0.0 constantly. Resp1 has no influence of the error, I think, because there is constantly a large SEM. In real the SEM of resp2 is 0.0 in most of the trials. Maybe this produces a bad covariance matrix.
Regarding 'Repeated point': there are no multiple observations at the same point
Regarding 'Bad kernel estimation': i printed out the hyperparams (see ALEBO_kernel_hyperparams_error.txt). I hope this is what you meant.
Great, it looks like we're making some progress then.
There are two kernel hyperparameters for this model: an output scale, and a Mahalanobis distance matrix. The output scale in the hyperparameter dump you provided looks good. Taking the "Uvec" hyperparameter, we can compute the Mahalanobis distance matrix like so:
d = 7
shapeU = Uvec.shape[:-1] + torch.Size([d, d])
triu_indx = torch.triu_indices(d, d, device=Uvec.device)
U_t = torch.zeros(shapeU, dtype=Uvec.dtype, device=Uvec.device)
U_t[..., triu_indx[1], triu_indx[0]] = Uvec
M = (U_t.transpose(1, 2) @ U_t)
and then look at its smallest eigenvalue:
for i in range(25):
print(torch.eig(M[i, :, :])[0][:, 0].min().item())
This prints the smallest eigvenvalue for each of the 25 posterior-sampled sets of hyperparameters as:
1.7575287471910963e-10
7.02167280627783e-10
8.421177058077722e-09
7.549683742133882e-07
5.180293883781498e-09
3.2218512837638487e-09
5.425286168003913e-10
5.012016770789702e-11
6.765819393375633e-08
4.685423905192713e-08
4.5551528211898307e-08
1.6915862733468524e-12
1.0319863044935401e-09
8.04074731025304e-08
6.461751881046865e-10
6.133606015970187e-10
1.7642966790309564e-13
5.692042377774931e-11
1.789688227452934e-11
7.770126041914509e-14
1.7877804084621364e-08
5.2172017581691255e-08
2.1131224924273813e-10
4.2630483362064594e-08
8.510738050865059e-10
These are all positive so the matrix is PD as it should be, but some of them are really close to 0. That may be leading to numerical issues downstream.
But it sounds like things are working smoothly with changes to the noise levels. Having a bit of noise is really helpful for numerical stability so we actually set a minimum noise level of 1e-7 even if the function is noiseless; that happens here: https://github.com/facebook/Ax/blob/a89fe684b0003edb17f1469d4e97055669d87629/ax/models/torch/alebo.py#L752
It sounds from your description like you do have an actual estimate of the noise level, but that inflating it a bit resolved the error here. The cost to the model performance from adding a little extra noise is pretty low, and well worth the improvement in model stability. A few more comments on the noise level -
Is the noise level being estimated from n
repeated trials? If so, then just to be clear, the noise level the model expects is the standard error of the mean, so stdev(y_1, ..., y_n) / sqrt(n)
(in the example evaluation function above you computed just the standard deviation). For n
small, this may for some points underestimate the true variance (by using the sample standard deviation instead of the true, population standard deviation) and so could be why inflating is necessary for model stability. What is n
for this function?
Thanks for your help! I changed the standard deviation to standard error. In our original eval function we have n=30.
I will close the issue if its fine for you. The optimization is now running stable. Thank you very much for explaining these things. It was very helpful :)
Awesome, glad to hear it's working now.
I started testing the ALEBO Strategy with 5 real parameters and 20 dummy parameters (see code below), based on this quickstart . After a few iteration i get the following error. Does anyone know how to fix it? I have seen that "RuntimeError: cholesky_cpu: U(1,1) is zero, singular U." appears more often in other issues but i think this might be a different problem.
Traceback (most recent call last): File "C:\code\00_Testbereich\BO\testBo.py", line 101, in
optMRP()
File "C:\code\00_Testbereich\BO\testBo.py", line 78, in optMRP
generation_strategy=alebo_strategy
File "C:\code\00_Testbereich\env\lib\site-packages\ax\service\managed_loop.py", line 224, in optimize
loop.full_run()
File "C:\code\00_Testbereich\env\lib\site-packages\ax\service\managed_loop.py", line 166, in full_run
self.run_trial()
File "C:\code\00_Testbereich\env\lib\site-packages\ax\service\managed_loop.py", line 145, in run_trial
experiment=self.experiment
File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\generation_strategy.py", line 376, in gen
keywords=get_function_argument_names(model.gen),
File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\base.py", line 626, in gen
model_gen_options=model_gen_options,
File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\array.py", line 225, in _gen
target_fidelities=target_fidelities,
File "C:\code\00_Testbereich\env\lib\site-packages\ax\modelbridge\torch.py", line 224, in _model_gen
target_fidelities=target_fidelities,
File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 657, in gen
model_gen_options=model_gen_options,
File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\botorch.py", line 367, in gen
acf_options,
File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 467, in ei_or_nei
X_pending=X_pending,
File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\botorch_defaults.py", line 171, in get_NEI
inf_cost = get_infeasible_cost(X=X_observed, model=model, objective=obj_tf)
File "C:\code\00_Testbereich\env\lib\site-packages\botorch\acquisition\utils.py", line 173, in get_infeasible_cost
posterior = model.posterior(X)
File "C:\code\00Testbereich\env\lib\site-packages\botorch\models\gpytorch.py", line 500, in posterior
mvns = self(*[X for in range(self.num_outputs)])
File "C:\code\00_Testbereich\env\lib\site-packages\gpytorch\models\modellist.py", line 83, in call
return [model.call(*args, kwargs) for model, args_ in zip(self.models, _get_tensor_args(args))]
File "C:\code\00_Testbereich\env\lib\site-packages\gpytorch\models\model_list.py", line 83, in
return [model.call( args, **kwargs) for model, args in zip(self.models, _get_tensor_args(*args))]
File "C:\code\00_Testbereich\env\lib\site-packages\ax\models\torch\alebo.py", line 170, in call
mvn = MultivariateNormal(mu, C)
File "C:\code\00_Testbereich\env\lib\site-packages\gpytorch\distributions\multivariate_normal.py", line 49, in init
super().init(loc=mean, covariance_matrix=covariance_matrix, validate_args=validate_args)
File "C:\code\00_Testbereich\env\lib\site-packages\torch\distributions\multivariate_normal.py", line 149, in init
self._unbroadcasted_scale_tril = torch.cholesky(covariance_matrix)
RuntimeError: cholesky_cpu: U(1,1) is zero, singular U.
before the error occurs it shows the following warnings:
C:\code\00_Testbereich\env\lib\site-packages\torch\nn\modules\module.py:385: UserWarning:
The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.
C:\code\00_Testbereich\env\lib\site-packages\gpytorch\distributions\multivariate_normal.py:229: NumericalWarning:
Negative variance values detected. This is likely due to numerical instabilities. Rounding negative variances up to 1e-10.
this is the code, excluding the evaluation function (which is a http request that only uses the real params):