Can Broyden's method avoid computing the Jacobian matrix?

davidweichiang commented 3 years ago

In the Nederhof and Satta paper, section 5, it seems they do not actually compute J (the Jacobian of F), but directly approximate the vector J^{-1} F, which would save a ton of memory if there are a lot of unknowns. By contrast, it seems that the current code does compute J.

davidweichiang commented 3 years ago

N&S refer to: https://archive.siam.org/books/textbooks/fr16_book.pdf, pages 126-127.

kennethsible commented 3 years ago

In the pull request for Broyden's method, I made the following comment about optimization:

Also, I still need to optimize Broyden's method, including the addition of Jacobian "inversion" through LU decomposition (or QR decomposition2) and benchmarking the "bad" Broyden's method and the Sherman-Morrison formula.

I spent today working on Newton's method, but I shifted toward optimizing Broyden's method when I noticed that you created an issue about the Jacobian on GitHub. Indeed, you should update the Jacobian inverse with each iteration of Broyden's method, but that would require the Sherman-Morrison formula mentioned in the pull request. I quickly implemented that formula for the Jacobian inverse and verified convergence of the algorithm. Note, the performance remained almost identical when using the Sherman-Morrison formula compared to implicitly inverting the Jacobian through torch.solve. I would imagine that the small size of the Jacobian matrix used throughout unit testing resulted in an insignificant performance gain.

In addition to the Sherman-Morrison formula, I also implemented the "bad" Broyden's method. In that method, the performance should be double that of the "good" Broyden's method using the Sherman-Morrison formula (4n^2 v. 2n^2). Note, the "good" Broyden's method minimizes the difference between subsequent gradients whereas the "bad" Broyden's method minimizes the difference between subsequent inverse gradients. For that reason, some people (including Broyden himself) argue that you should only use the "good" Broyden's method (see http://paulklein.se/newsite/teaching/broyden.pdf), while others argue for the "bad" method (see A faster Broyden method). In my testing, I found that the "bad" method took more iterations to converge and required almost twice the runtime, contracting the claim of double the performance of the "good" method. I don't know yet what caused that discrepancy, but I'll continue investigating the problem. In the meantime, I'll commit the updated "good" Broyden's method using the Sherman-Morrison formula.

davidweichiang commented 3 years ago

I think the optimization in the references I linked above is more than just Sherman-Morrison, right? It doesn't just store and update J^{-1}; it stores and updates J^{-1} F, where F is the vector-valued function being optimized and J is its Jacobian matrix (sorry if I chose bad notation). So this is a vector and therefore much smaller.

davidweichiang commented 3 years ago

At this point, benchmarking isn't going to tell us very much, because the examples are so small. Would it be helpful to make bigger ones?

kennethsible commented 3 years ago

I think they are using the notation J^{-1} F to denote the inverse Jacobian of the vector-valued function F, not a matrix multiplication between J^{-1} and F.

kennethsible commented 3 years ago

Yes, larger examples would be very helpful for benchmarking, especially since that's where Broyden's method dominates.

davidweichiang commented 3 years ago

Sorry, it's my bad notation. In their notation, s is an approximation of -JF(x)^{-1} \cdot F(x), where x is the current approximate solution, F(x) is the function we're trying to find the zero of, and JF(x) is the Jacobian of F at x. So s is a vector. In the pseudocode in Section 5, all the quantities are scalars or vectors.

kennethsible commented 3 years ago

Ah, I understand what you mean now. I am still going to do that optimization. If you look at the paragraph above equation 2.6 in my second link (A faster Broyden method), you'll see an explanation of what they did. I am only pushing one optimization at a time, and I still need to hold the Jacobian constant for k iteration(s).

kennethsible commented 3 years ago

I'll push each optimization today (probably before noon) and link them all below.

davidweichiang commented 3 years ago

I added a bigger example in #49. You can see in this example how sparse the tensors can sometimes be -- something to think about in the future. The current sum_product.py seems to handle this example pretty quickly (it's a very small grammar taken from an NLP homework), so we can make much bigger examples if desired.

diprism / fggs

Can Broyden's method avoid computing the Jacobian matrix? #48