GPflow / GPflowOpt

Bayesian Optimization using GPflow
Apache License 2.0
270 stars 61 forks source link

Cholesky decomposition fails #38

Open mccajm opened 7 years ago

mccajm commented 7 years ago

I receive the following error when performing optimisation with GPR over 2 dimensions, using GPR with an RBF ARD kernel and a latin hypercube design of size 10. I assume this is because the matrix can't be decomposed? Is this fixable by changing the design or adding priors?

Thanks

2017-07-20 01:50:18.494935: W tensorflow/core/framework/op_kernel.cc:1158] Internal: cuSolverDN call failed with status =7 Traceback (most recent call last): File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1139, in _do_call return fn(*args) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run_fn status, run_metadata) File "/home/adathy/miniconda3/lib/python3.6/contextlib.py", line 89, in exit next(self.gen) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InternalError: cuSolverDN call failed with status =7 [[Node: Cholesky_1 = CholeskyT=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/gpu:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "t1-hyperparam.py", line 103, in optimizer.optimize(run_model, n_iter=10) File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/bo.py", line 131, in optimize File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/optim.py", line 79, in optimize File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/bo.py", line 147, in _optimize File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/bo.py", line 67, in _update_model_data File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/acquisition.py", line 122, in set_data File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/acquisition.py", line 254, in setup File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflow-0.3.8-py3.6.egg/GPflow/param.py", line 569, in runnable return storage['session'].run(storage['tf_result'], feed_dict=feed_dict) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 789, in run run_metadata_ptr) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 997, in _run feed_dict_string, options, run_metadata) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run target_list, options, run_metadata) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: cuSolverDN call failed with status =7 [[Node: Cholesky_1 = CholeskyT=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/gpu:0"]] Caused by op 'Cholesky_1', defined at: File "t1-hyperparam.py", line 101, in acquisition = GPflowOpt.acquisition.ExpectedImprovement(model) File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/acquisition.py", line 248, in init self.setup() File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/acquisition.py", line 254, in setup samplesmean, = self.models[0].predict_f(feasible_samples) File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflow-0.3.8-py3.6.egg/GPflow/param.py", line 561, in runnable storage['tf_result'] = tf_method(instance, *storage['tf_args']) File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflow-0.3.8-py3.6.egg/GPflow/model.py", line 373, in predict_f return self.build_predict(Xnew) File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/scaling.py", line 210, in build_predict return self.output_transform.build_backward(f), self.output_transform.build_backward_variance(var) File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/transforms.py", line 112, in build_backward L = tf.cholesky(tf.transpose(self.A)) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_linalg_ops.py", line 227, in cholesky result = _op_def_lib.apply_op("Cholesky", input=input, name=name) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1269, in init self._traceback = _extract_stack()

InternalError (see above for traceback): cuSolverDN call failed with status =7 [[Node: Cholesky_1 = CholeskyT=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/gpu:0"]]

javdrher commented 7 years ago

This issue is indeed caused by a cholesky decomposition faillure. The reason why this happens can be a bit diverse. Does this happen immediately after the initial 10 points? or have you done some iterations of BayesianOptimizer? In case of the former: first try to model the points with the GPflow model itself. tune the initial hyperparameters or add a prior. In case of the latter: check the data before it crashes. Do you have duplicate points? If not, try to model it again and tune the initial hyperparameters/priors.

I have also opened a PR (#40) which will make saving data in case of a crash easier. Just resolving some compatibility issues now.