Open smartalecH opened 3 years ago
My concern is that, in solving the dual problem for the preconditioned CCSA algorithm, you need to evaluate the action of the Hessian many times, so the cost is greater than you think.
It might be worth it then to cache one (or more gradients) and approximate the hessian-vector-product using a finite difference. In other words, a variant of L-BFGS using CCSA algorithms.
This would still require a gradient calculation at every inner iteration, but that's only one more simulation vs three more simulations (as required by an explicit HVP).
That could work. IIRC, the only requirement for CCSA to converge is that your preconditioner (Hessian estimate) must be positive-semidefinite (which keeps the inner approximation conservative and convex), and L-BFGS mostly guarantees. However, you will need to check that the second Wolfe condition holds before updating your BFGS Hessian, as described in section 3.1 of my notes: https://github.com/mitmath/18335/blob/master/notes/BFGS.pdf
Nonlinear optimization algorithms that leverage just the gradient information (i.e. "first-order methods") can have trouble traversing through the cost function as the hessian becomes ill-conditioned. For example, "ravines" and "plateaus" can force the optimizer to oscillate rapidly, especially if nonlinear constraints "squeeze" the feasible region.
We know that many algorithms can use the hessian information to "precondition" the solver, such that it quickly traverses down these difficult regions. Some solvers that directly leverage the hessian (e.g. Newton solvers) can even achieve quadratic convergence.
Unfortunately, computing the Hessian directly within the context of photonic topology optimization isn't practical. Luckily, we can use adjoint methods to efficiently compute the action of the hessian in a particular direction. Specifically, if the forward run results (y) and adjoint run results(λ) are still stored from a gradient computation, then the hessian action on a vector w can be calculated with just two more linear solves (a new forward run m and a new adjoint run μ). So long as the vector w can be provided ahead of time (which is often the case) then the next optimization step can be calculated with 4 linear solves.
While this may seem twice as expensive, in reality, this should be able to obtain (about) quadratic convergence and adjust the problem when it gets sufficiently stuck.
The recombination step is a little tricky, especially as we integrate all the subpixel smoothing into the material grid, but it certainly seems worth it.