Rewrite Stage 2 in Julia

chusloj commented 4 years ago

There is a replicability issue using the CVXOPT solver in Python to calculate the PUF and CPS weights - the md5 hash for the weight file changes every time the solver is run. Because most of the commonly used LP solvers do not have clean APIs for Python, the stage2 and solve_lp_for_year scripts should be re-written in Julia. The language has a clean optimization & modeling interface called JuMP that can be used for any LP optimization model that has a Julia implementation.

There are a few reasons why using Julia would be advantageous to Python for the solver stage:

Julia has a pandas interface, so refactoring the data processing portions of the code shouldn't be too time consuming.
Using JuMP, any new solver can be used with the same code because JuMP is model-agnostic.
Julia is a compiled language, so it might have a speed advantage over Python.

Coding the Julia version of the code should take substantially less time than finding the md5 replicability bug.

andersonfrailey commented 4 years ago

Just to add a little more information about the replicability issue, @chusloj and I have been going back and forth about how the cps_weights.csv.gz file changes in PR #343. As part of our investigation into why that was happening we each re-ran the CPS stage 2 scripts multiple times and could never get the same md5 hash. I even compared the hashes between multiple weights files I created on my machine off the master branch and each one was different.

@chusloj did some comparisons and it seems like the differences between files are very small, but they're different nonetheless. Given all of this and the fact that the pulp solver we currently use for the PUF takes six to eight hours to run, I'm in favor of converting all of the stage 2 scripts to Julia and if the results look good making that our permanent solution.

EDIT: the file differences were small in the sense that only a small percentage of records were affected, but those that were often had large differences between them.

donboyd5 commented 4 years ago

I have been running stage2.py on the puf (although still need to understand #352), which led me to some thoughts about your rewrite of stage 2. I think the comments (if on target) are relevant regardless of whether written in Julia or python.

As I watched stage2.py solve, I saw that each new year seemed to take more iterations than the year before, generally resulted in a higher objective function value, and often took a bit more time although that was not always true.

This led me to look at the code in stage2.py and solve_lp_for_year.py. If I read it properly (not 100% certain of this), I think the penalty function in all cases is based on how different the new weights are from the 2011 weights -- that is, the 2012 penalty is based on difference between 2011 weights and solved-for 2012 weights, ..., and the 2030 penalty is based on the difference between 2011 weights and solved-for 2030 weights.

If this is really what is happening, I am not sure it is the best approach either from a conceptual standpoint or from the standpoint of computer effort.

From a conceptual perspective, it seems to me like we would generally expect the distribution of returns to be more like that of the prior year than that of many years ago, and that we might not be surprised if some kinds of returns had far lower weights in later years than earlier years. That is probably why the objective function gets so much larger in later years: we have to change weights very much from the initial values. That suggests to me that we might rather penalize changes from the immediately prior year than the initial year.

Doing that would no doubt make solution easier. A lot of hard work might be done solving for weights that achieve distributional targets in 2012, and once those are hit, targets for each later year might be relatively close to the previous year.

On the other hand, some individual interim years might be oddballs and we might not necessarily want to penalize changes from such a year. For example, a year in which deductions were accelerated by high-income taxpayers might be an oddball year, and we might not necessarily want to penalize large changes in the next year vs. this oddball year - but whether that means it is best to penalize changes from the initial year of 2011 is an open question.

Perhaps this was behind JCT's thinking. If you look at pages 52-55 of the following document:

Staff of the Joint Committee on Taxation. “Estimating Changes in the Federal Individual Income Tax: Description of the Individual Tax Model.” Joint Committee on Taxation, April 23, 2015. https://www.jct.gov/publications.html?func=download&id=4776&chk=4776&no_html=1.

you will see that they penalize differences from both the initial-year weights and the previous-year weights, downweighting the importance of the initial year as later years are solved, and upweighting the importance of the previous year.

Whether that is worth the extra effort, I don't know. It might be worth talking to someone at JCT about this.

There are a few other issues you might want to consider:

Are you going to target the distribution of more variables (beyond wages, which I think is the only variable targeted distributionally, other than interest income in stage3 which seems to me better folded into stage2)? I think that the previous-year comparison will become more important if you increase the focus on distributional targets rather than just aggregate targets. I suspect greater focus on distributional targets will lead to desirable tax calculation results.
Creating an optimization problem as complex as what JCT does may require different optimization software than what you are using now, although I think the port to Julia will give you every option you could need.
JCT uses a nonlinear objective function, which might require a different solver than you use now, if you buy into their philosophy (which I agree with) that large changes in weights should be disproportionately penalized, if you think the weights you are comparing them to are "good" in some sense.
Right now, the problem is embarrassingly parallel - each year can be solved in isolation. If you penalize differences from the prior rather than the base year, you lose this easy parallelization. I'm not sure it's worth making this a driving factor in the decision. A better solver is likely to be much faster than the current implementation, even in parallel. Still, it is a consideration.
OTOH, if you could create a set of distributional targets for each year for each income range, in way that adds up to what you desire as file-total aggregates, you'd regain the embarrassingly parallel quality, albeit differently. You could solve all 10 (or however many) income ranges for each year in parallel, even though you couldn't progress to the next year for a given range until you finish the previous year.
Even if you couldn't quite define targets by income range for every file-total you have, there might be ways to take advantage of this so that you still gain a lot of parallelization advantage. For example, you might be able to solve all 10 income ranges for the variables for which you have distributional targets, and then do an 11th solve on the entire file, starting with the weights you obtained from the first 10 solves, that ensures that the file totals for the variables for which you did not have distributional targets are hit - you might either leave out the targets from the first 10 solves in this 11th solve and hope they don't change too much, or make the 11th solve big and include those targets too, expecting the solve to go pretty easily because you have such a good starting point.
If penalties are based at least in part on the previous year, then if you change targets for any year prior to the end year you'll have to rerun stage2 targeting for the changed year and every subsequent year.

Anyway, those are my thoughts based on trying to run taxdata today.

Don

PSLmodels / taxdata

Rewrite Stage 2 in Julia #348