This change addresses two challenges I was finding with lsq_linear: (1) the new weights sometimes appeared implausibly large, e.g., thousands of times as large as the original, and (2) my efforts to constrain weights by setting upper bounds on new weights were making the problem difficult for lsq_linear to solve, taking thousands of iterations and several minutes and still leading to large differences from targets, whereas problems without upper bounds were solving in a dozen iterations and less than a second.
It is common in reweighting efforts to solve for the ratio of new weights to original weights, rather than solving for new weights directly (even though in concept they can lead to the same result) for at least two reasons: (1) many reweighting efforts (e.g., JCT and taxdata) seek to minimize changes in weights and thus penalize weight changes by penalizing the ratio of new to original weights, which makes sense for national efforts although it makes less sense when constructing subnational files from a national file, and (2) the problem often seems more stable, numerically, when the x variable being solved for is centered near 1 (a ratio), rather than ranging from close to zero to possibly many thousands (a weight).
This PR:
Changes the problem so that lsq_linear solves for the ratio rather than the new weight. It achieves this by multiplying the columns of the coefficient matrix (variable_matrix.T -- the A matrix in Ax = b nomenclature) by the original weight (wght) before transposing. (The A matrix passed to lsq_linear is (variable_matrix * wght[:, np.newaxis]).T) Thus, the x being solved for will be the ratio of the new weight to the original.
Changes the lower and upper bound calculations to be bounds on the ratio, and sets them, for discussion purposes, at 0.01 and 10.0, but that can be changed. What those bounds should be depends, I suppose, on how much we think the distribution of taxpayers in a subnational area can plausibly differ from the national average. I am concerned that if we don't constrain these changes, and are only trying to hit a few handfuls of targets, that allowing large changes in weights will lead to large unintended consequences for variables we don't (and cannot practically) target.
Adds optional reporting on quantiles of original weights, new weights, and the resulting ratios.
In examining the results of this PR, in comparison to attempting to set bounds on new weights directly, I have found that:
The ratio approach appears to improve the numerical qualities of the problem substantially, so that lsq_linear hits targets exactly even with sharp limits on weight changes, whereas it cannot do that when solving for new weights.
It solves these problems in a handful of iterations whereas with direct limits on weights it was taking thousands of iterations.
The ratio of new to original weights seems to stay closer to 1 than it does when solving for new weights directly, even when not imposing upper bounds on the ratio.
Thus, preliminary results are attractive. I think as we move from hypothetical problems to real-world problems, we will have to keep our options open. We will undoubtedly encounter new issues and we may have reason to revisit the question of whether to solve for weights or ratios, and the question of whether to use a dedicated least-squares solver such or lsq_linear or a more-general solver such as L-BFGS-B.
My PR is failing with some sort of bad credentials error (see below). I am guessing it is related to access to the 2015 PUF, which I have locally of course. I don't know how to fix this. Advice?
This change addresses two challenges I was finding with
lsq_linear
: (1) the new weights sometimes appeared implausibly large, e.g., thousands of times as large as the original, and (2) my efforts to constrain weights by setting upper bounds on new weights were making the problem difficult forlsq_linear
to solve, taking thousands of iterations and several minutes and still leading to large differences from targets, whereas problems without upper bounds were solving in a dozen iterations and less than a second.It is common in reweighting efforts to solve for the ratio of new weights to original weights, rather than solving for new weights directly (even though in concept they can lead to the same result) for at least two reasons: (1) many reweighting efforts (e.g., JCT and taxdata) seek to minimize changes in weights and thus penalize weight changes by penalizing the ratio of new to original weights, which makes sense for national efforts although it makes less sense when constructing subnational files from a national file, and (2) the problem often seems more stable, numerically, when the x variable being solved for is centered near 1 (a ratio), rather than ranging from close to zero to possibly many thousands (a weight).
This PR:
lsq_linear
solves for the ratio rather than the new weight. It achieves this by multiplying the columns of the coefficient matrix (variable_matrix.T
-- the A matrix in Ax = b nomenclature) by the original weight (wght
) before transposing. (The A matrix passed tolsq_linear
is(variable_matrix * wght[:, np.newaxis]).T
) Thus, the x being solved for will be the ratio of the new weight to the original.In examining the results of this PR, in comparison to attempting to set bounds on new weights directly, I have found that:
lsq_linear
hits targets exactly even with sharp limits on weight changes, whereas it cannot do that when solving for new weights.Thus, preliminary results are attractive. I think as we move from hypothetical problems to real-world problems, we will have to keep our options open. We will undoubtedly encounter new issues and we may have reason to revisit the question of whether to solve for weights or ratios, and the question of whether to use a dedicated least-squares solver such or
lsq_linear
or a more-general solver such asL-BFGS-B
.