RuntimeError when training to 13 epochs

ggosjw commented 2 years ago

First, thank you for your amazing work!

When I trained to 13 epochs I encounter this error: RuntimeError: There was an error while running the linear optimizer. Original error message: torch.linalg_cholesky: (Batch element 8): The factorization could not be completed because the input is not positive-definite (the leading minor of order 18 is not positive-definite).. Backward pass will not work. To obtain the best solution seen before the error, run with torch.no_grad()

Could you kindly help me to find the reason? The full log is shown below:

Epoch 13/20 Train Progress: [ 29984/ 36111] Loss: 5.0243 0.2171s/sampleTraceback (most recent call last): File "/home/moovita/theseus/theseus/optimizer/nonlinear/nonlinear_optimizer.py", line 274, in _optimize_loop delta = self.compute_delta(**kwargs) File "/home/moovita/theseus/theseus/optimizer/nonlinear/gauss_newton.py", line 47, in compute_delta return self.linear_solver.solve() File "/home/moovita/theseus/theseus/optimizer/linear/dense_solver.py", line 113, in solve return self._apply_damping_and_solve( File "/home/moovita/theseus/theseus/optimizer/linear/dense_solver.py", line 75, in _apply_damping_and_solve return self._solve_sytem(Atb, AtA) File "/home/moovita/theseus/theseus/optimizer/linear/dense_solver.py", line 157, in _solve_sytem lower = torch.linalg.cholesky(AtA) torch._C._LinAlgError: torch.linalg_cholesky: (Batch element 8): The factorization could not be completed because the input is not positive-definite (the leading minor of order 18 is not positive-definite).

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 251, in model_training() File "train.py", line 207, in model_training train_loss, train_metrics = train_epoch(train_loader, predictor, planner, optimizer, args.use_planning) File "train.py", line 53, in train_epoch final_values, info = planner.layer.forward(planner_inputs) File "/home/moovita/theseus/theseus/theseus_layer.py", line 88, in forward vars, info = _forward( File "/home/moovita/theseus/theseus/theseus_layer.py", line 148, in _forward info = optimizer.optimize(optimizer_kwargs) File "/home/moovita/theseus/theseus/optimizer/optimizer.py", line 43, in optimize return self._optimize_impl(kwargs) File "/home/moovita/theseus/theseus/optimizer/nonlinear/nonlinear_optimizer.py", line 357, in _optimize_impl self._optimize_loop( File "/home/moovita/theseus/theseus/optimizer/nonlinear/nonlinear_optimizer.py", line 281, in _optimize_loop raise RuntimeError( RuntimeError: There was an error while running the linear optimizer. Original error message: torch.linalg_cholesky: (Batch element 8): The factorization could not be completed because the input is not positive-definite (the leading minor of order 18 is not positive-definite).. Backward pass will not work. To obtain the best solution seen before the error, run with torch.no_grad()

MCZhi commented 2 years ago

Thank you for your interest in our work. This problem is caused by Theseus solver that torch.linalg_cholesky cannot handle this case. I have no solution to this problem but a workaround is to change the linear_solver_cls from th.CholeskyDenseSolver to th.CholmodSparseSolver in the MotionPlanner class in planner.py, which is more stable.

ggosjw commented 2 years ago

It really works for me! Thank you for your kind help and your great work.

MCZhi / DIPP

RuntimeError when training to 13 epochs #1