Proposed approach to jointly minimizing target differences and weight changes

@martinholmer Seeking your review. Not appropriate for merging but rather for discussion. Please see create_area_weights_lbfgsb_jvp.py -- other files in the PR are extraneous. You can run create_area_weights_lbfgsb_jvp.py on aa, bb, zz in the usual way.

This approach solves for the ratios of new weights to original weights (x) by minimizing an objective function with two terms: (1) sum of squared proportionate differences of calculated values (Ax) from targets (b) (in the Ax = b nomenclature), and (2) sum of squared differences of x values from 1. The importance placed on minimizing weight differences as opposed to minimizing target differences is controlled by a regularization parameter, lambda (REGULARIZATION_LAMBDA in the code); lambda should be nonnegative. It is the approach I had hoped we would get to in Phase 3.

It uses L-BFGS-B for speed and memory efficiency, Jax autodiff for speed, sparse arrays for memory efficiency and speed, and your matrix scaling for numerical stability. It is quite fast (albeit nothing will be as fast as lsq_linear without regularization); certainly far faster than stochastic gradient descent, and that did not minimize weight differences. Adding weight-difference minimization adds a lot of time because specialized solvers (e.g., lsq_linear) are either no longer possible or get bogged down by the vast increase in problem size. Jax can use GPUs but I have not been taking advantage of that. My guess is that given its current CPU speed, GPU overhead might wipe out any gains to be had unless our problems get a lot harder. There are a few other things we might do to speed it up but I think we're close to as fast as we can get. (Some early tests suggest that @jax.jit can reduce run time by 10-20% and might save more time on especially challenging problems.)

The coding is not up to your standards; I'm sorry about that. I was trying to move quickly and took some shortcuts and also may have created an object or two that weren't necessary, although I did try to make it efficient in other ways. It's going to be awhile before our needs settle down. I think this approach could be our main approach, assuming we want to jointly minimize the two differences noted above. Nonetheless, I think we should keep the others available, also. If we need, there are other possible approaches mentioned in the google doc, but I am optimistic that this will do the trick. It could be useful to give each method its own .py file and source or modularize them so that we can easily pick any of the methods we have developed. I didn't want to take that on unless you wanted it, though.

I added additional statistics and reporting, some of which we may want to drop as we move toward production. We certainly can drop or pare back the display of iterations. However, I think two additions are important: (1) the sum of pre-and post-optimization weights (as mentioned elsewhere, we might want these two to be equal), and (2) the quantiles of percentage differences in calculated target values from desired target values. It doesn't matter now, when we are doing 8-16 targets, but if we have several hundred targets we'll want summary measures for how we did against targets.

As I mentioned in a previous PR and the google doc, I think our test problems are VERY hard (even though they are very small), because the proportionate-weight-adjusted initial values are sometimes many multiples of the targets, and other times quite close to targeted values, so some weights have to change quite a bit to hit targets while others do not or must change a lot in the other direction. I think much larger real-world problems probably will solve more quickly.

I copy below the results for aa, using parameters I consider to be reasonable but judgmentally set. As I mentioned in an earlier note, a little bit of regularization goes a long way. I set REGULARIZATION_LAMBDA to 5e-7 because at larger values we miss the targets by too much (in my opinion) while forcing weight ratios to be closer to 1 than we might feel necessary. I set OPTIMIZATION_FTOL to 1e-7. A smaller ftol might be nice but that does take more time; 1e-7 definitely leads to a big improvement in results vs. 1e-6, but I am not sure (yet) whether smaller than 1e-7 is worth the extra time. aa took 43 iterations and 3.6 seconds on my 3-year-old machine; bb took 8.3 seconds and zz, where the initial value for the 3rd target is 86 times the desired target value, took 28 seconds.

** NOTE: an earlier version of this comment said that it also used Jacobian vector products for memory efficiency, but that only pertained to earlier draft code when I was examining other solvers. L-BFGS-B has its own memory-conserving approximation method (the "L" stands for limited memory) and JVPs are not needed or used in the code I uploaded.

PSLmodels / tax-microdata-benchmarking

Proposed approach to jointly minimizing target differences and weight changes #192