JuliaImageRecon / RegularizedLeastSquares.jl

MIT License
20 stars 9 forks source link

implement OptISTA/POGM/FISTA with adaptive restart #52

Closed andrewwmao closed 1 year ago

andrewwmao commented 1 year ago

Hi @tknopp @migrosser @JakobAsslaender @JeffFessler, I implemented the POGM method originally described here and the adaptive restart with decaying momentum version described here. I put these references in the comments.

  1. POGM should be expected to have a 2x better worst-case convergence than FISTA, which is confirmed by my own personal experience in both 2D and 3D radial data. It is also passing the solver tests. Decaying momentum does not seem to help in my own case and is off by default for POGM.
  2. I only implemented the gradient-based restarting scheme for now, as it is the cheapest to compute, and also added this option to the FISTA algorithm. It is off by default for both algs.
  3. Unfortunately, I had to introduce 3 auxiliary variables the same size as the image x for POGM which does inflate the memory requirements relative to FISTA. I am wondering especially if @JeffFessler can take a look as he is one of the authors of said algorithm ;)
JakobAsslaender commented 1 year ago

I made those changes to remove allocations. From my end, this should be good to go. Thanks, Andrew!

JeffFessler commented 1 year ago

It's great to see this implemented with "in place" operations. We have a version here that does not have those memory savings: https://github.com/JeffFessler/MIRT.jl/blob/main/src/algorithm/general/pogm_restart.jl (It might be wise to check that the two versions give the same results.) A former student @cecroc started an in-place version but I think it is still a WIP.

In terms of memory, there is an alternative to POGM that just appeared earlier this week (!) that has the same rate: http://arxiv.org/abs/2305.15704 The math looks slightly simpler, but I have not studied it carefully enough to see if it reduces the number of auxiliary variables that one must store.

andrewwmao commented 1 year ago

Very interesting reference! At a quick glance it looks like it would require the same # of intermediate variables. But is it correct to say that this guarantees a 2x better worst-case rate than FISTA, whereas POGM does not? I may have a look to see if I can implement this as well.

JeffFessler commented 1 year ago

2x better worst-case rate than FISTA, whereas POGM does not?

POGM has the same 2x factor by a "computer assisted" numerical proof. OptISTA has an analytical proof. In both cases the 2x is "worst case" bound so it can be interesting to compare them for any given practical problem.

andrewwmao commented 1 year ago

I implemented OptISTA as well and in my experience so far in our 2D/3D radial data, the performance is virtually identical to that of POGM! I did manage to reduce 1 additional auxiliary variable, and the code is indeed simpler compared to POGM.

JeffFessler commented 1 year ago

Terrific!

I did manage to reduce 1 additional auxiliary variable Did that reduction apply to both POGM and OptISTA or just one of them?

andrewwmao commented 1 year ago

That is with respect to POGM, so still 2 more compared to FISTA. The extra variable in my POGM implementation comes from the gradient restart part, which isn't used in OptFISTA, so they are actually comparable.

JakobAsslaender commented 1 year ago

One thing worth considering is if we want to create two separate struct for FISTA with and w/o gradient restart (same for POGM). This would save some memory in the latter case, but make the code a bit more ugly.

tknopp commented 1 year ago

This all looks very good. Thank you for the contribution @andrewwmao!

I will merge this now and more optimization can be made in new PR. @andrewwmao: If you don't have commit rights I will grant you that.

One thing worth considering is if we want to create two separate struct for FISTA with and w/o gradient restart (same for POGM). This would save some memory in the latter case, but make the code a bit more ugly.

If possible it is better to keep one struct. It is no problem to initialize unused vectors with length 0 which will then have 0 overhead. An alternative is to use a parametric type and make tmp::Union{Nothing, Vector{T}}.

andrewwmao commented 1 year ago

I don't believe I currently have commit rights -- thanks! I will look into these further optimizations when I have a chance.