Closed davidweichiang closed 2 years ago
N&S refer to: https://archive.siam.org/books/textbooks/fr16_book.pdf, pages 126-127.
In the pull request for Broyden's method, I made the following comment about optimization:
Also, I still need to optimize Broyden's method, including the addition of Jacobian "inversion" through LU decomposition (or QR decomposition2) and benchmarking the "bad" Broyden's method and the Sherman-Morrison formula.
I spent today working on Newton's method, but I shifted toward optimizing Broyden's method when I noticed that you created an issue about the Jacobian on GitHub. Indeed, you should update the Jacobian inverse with each iteration of Broyden's method, but that would require the Sherman-Morrison formula mentioned in the pull request. I quickly implemented that formula for the Jacobian inverse and verified convergence of the algorithm. Note, the performance remained almost identical when using the Sherman-Morrison formula compared to implicitly inverting the Jacobian through torch.solve
. I would imagine that the small size of the Jacobian matrix used throughout unit testing resulted in an insignificant performance gain.
In addition to the Sherman-Morrison formula, I also implemented the "bad" Broyden's method. In that method, the performance should be double that of the "good" Broyden's method using the Sherman-Morrison formula (4n^2 v. 2n^2). Note, the "good" Broyden's method minimizes the difference between subsequent gradients whereas the "bad" Broyden's method minimizes the difference between subsequent inverse gradients. For that reason, some people (including Broyden himself) argue that you should only use the "good" Broyden's method (see http://paulklein.se/newsite/teaching/broyden.pdf), while others argue for the "bad" method (see A faster Broyden method). In my testing, I found that the "bad" method took more iterations to converge and required almost twice the runtime, contracting the claim of double the performance of the "good" method. I don't know yet what caused that discrepancy, but I'll continue investigating the problem. In the meantime, I'll commit the updated "good" Broyden's method using the Sherman-Morrison formula.
I think the optimization in the references I linked above is more than just Sherman-Morrison, right? It doesn't just store and update J^{-1}; it stores and updates J^{-1} F, where F is the vector-valued function being optimized and J is its Jacobian matrix (sorry if I chose bad notation). So this is a vector and therefore much smaller.
At this point, benchmarking isn't going to tell us very much, because the examples are so small. Would it be helpful to make bigger ones?
I think they are using the notation J^{-1} F to denote the inverse Jacobian of the vector-valued function F, not a matrix multiplication between J^{-1} and F.
Yes, larger examples would be very helpful for benchmarking, especially since that's where Broyden's method dominates.
Sorry, it's my bad notation. In their notation, s is an approximation of -JF(x)^{-1} \cdot F(x), where x is the current approximate solution, F(x) is the function we're trying to find the zero of, and JF(x) is the Jacobian of F at x. So s is a vector. In the pseudocode in Section 5, all the quantities are scalars or vectors.
Ah, I understand what you mean now. I am still going to do that optimization. If you look at the paragraph above equation 2.6 in my second link (A faster Broyden method), you'll see an explanation of what they did. I am only pushing one optimization at a time, and I still need to hold the Jacobian constant for k
iteration(s).
I'll push each optimization today (probably before noon) and link them all below.
I added a bigger example in #49. You can see in this example how sparse the tensors can sometimes be -- something to think about in the future. The current sum_product.py seems to handle this example pretty quickly (it's a very small grammar taken from an NLP homework), so we can make much bigger examples if desired.
In the Nederhof and Satta paper, section 5, it seems they do not actually compute J (the Jacobian of F), but directly approximate the vector J^{-1} F, which would save a ton of memory if there are a lot of unknowns. By contrast, it seems that the current code does compute J.