Open chusloj opened 4 years ago
Just to add a little more information about the replicability issue, @chusloj and I have been going back and forth about how the cps_weights.csv.gz
file changes in PR #343. As part of our investigation into why that was happening we each re-ran the CPS stage 2 scripts multiple times and could never get the same md5
hash. I even compared the hashes between multiple weights files I created on my machine off the master branch and each one was different.
@chusloj did some comparisons and it seems like the differences between files are very small, but they're different nonetheless. Given all of this and the fact that the pulp
solver we currently use for the PUF takes six to eight hours to run, I'm in favor of converting all of the stage 2 scripts to Julia and if the results look good making that our permanent solution.
EDIT: the file differences were small in the sense that only a small percentage of records were affected, but those that were often had large differences between them.
I have been running stage2.py
on the puf (although still need to understand #352), which led me to some thoughts about your rewrite of stage 2. I think the comments (if on target) are relevant regardless of whether written in Julia or python.
As I watched stage2.py
solve, I saw that each new year seemed to take more iterations than the year before, generally resulted in a higher objective function value, and often took a bit more time although that was not always true.
This led me to look at the code in stage2.py
and solve_lp_for_year.py
. If I read it properly (not 100% certain of this), I think the penalty function in all cases is based on how different the new weights are from the 2011 weights -- that is, the 2012 penalty is based on difference between 2011 weights and solved-for 2012 weights, ..., and the 2030 penalty is based on the difference between 2011 weights and solved-for 2030 weights.
If this is really what is happening, I am not sure it is the best approach either from a conceptual standpoint or from the standpoint of computer effort.
From a conceptual perspective, it seems to me like we would generally expect the distribution of returns to be more like that of the prior year than that of many years ago, and that we might not be surprised if some kinds of returns had far lower weights in later years than earlier years. That is probably why the objective function gets so much larger in later years: we have to change weights very much from the initial values. That suggests to me that we might rather penalize changes from the immediately prior year than the initial year.
Doing that would no doubt make solution easier. A lot of hard work might be done solving for weights that achieve distributional targets in 2012, and once those are hit, targets for each later year might be relatively close to the previous year.
On the other hand, some individual interim years might be oddballs and we might not necessarily want to penalize changes from such a year. For example, a year in which deductions were accelerated by high-income taxpayers might be an oddball year, and we might not necessarily want to penalize large changes in the next year vs. this oddball year - but whether that means it is best to penalize changes from the initial year of 2011 is an open question.
Perhaps this was behind JCT's thinking. If you look at pages 52-55 of the following document:
Staff of the Joint Committee on Taxation. “Estimating Changes in the Federal Individual Income Tax: Description of the Individual Tax Model.” Joint Committee on Taxation, April 23, 2015. https://www.jct.gov/publications.html?func=download&id=4776&chk=4776&no_html=1.
you will see that they penalize differences from both the initial-year weights and the previous-year weights, downweighting the importance of the initial year as later years are solved, and upweighting the importance of the previous year.
Whether that is worth the extra effort, I don't know. It might be worth talking to someone at JCT about this.
There are a few other issues you might want to consider:
Anyway, those are my thoughts based on trying to run taxdata today.
Don
There is a replicability issue using the
CVXOPT
solver in Python to calculate the PUF and CPS weights - themd5
hash for the weight file changes every time the solver is run. Because most of the commonly used LP solvers do not have clean APIs for Python, thestage2
andsolve_lp_for_year
scripts should be re-written in Julia. The language has a clean optimization & modeling interface calledJuMP
that can be used for anyLPoptimization model that has a Julia implementation.There are a few reasons why using Julia would be advantageous to Python for the solver stage:
pandas
interface, so refactoring the data processing portions of the code shouldn't be too time consuming.JuMP
, any new solver can be used with the same code becauseJuMP
is model-agnostic.Coding the Julia version of the code should take substantially less time than finding the
md5
replicability bug.