Closed donboyd5 closed 1 month ago
@donboyd5 said in issue #197:
However, it might also make sense to set the program up to do many areas sequentially, driven by a file that has multiple areas (possibly a subset of the big file I just described, with areas and targets tailored to the problem at hand). In this case we'd want to have some sort of "apply" ability to apply a create-area-file function mutiple times.
Both of these approaches seem to be at variance with the approach we decided on weeks ago and for which create_areas_weights.py
is designed to handle. Each area needs a {area}_targets.csv
file in the areas/targets
folder and create_areas_weights.py
will produce a {area}_tmd_weights.csv.gz
file in the areas/weights
folder. We can discuss how to automate this so weight files are not unnecessarily produced (sort of like a Makefile works).
Here is the docstring at the top of create_areas_weights.py
:
"""
Construct AREA_tmd_weights.csv.gz, a Tax-Calculator-style weights file
for 2021+ for the specified sub-national AREA.
AREA prefix for state areas are the two lower-case character postal codes.
AREA prefix for congressional districts are the state prefix followed by
two digits (with a leading zero) identifying the district. There are no
district files for states with only one congressional district.
"""
That's true. It seems valuable to me, especially in light of Matt's interest in hitting a wide range of CDs . But perhaps the value does not outweigh the work involved. I understand completely the need to weigh value against work and am not trying to force it.
It appears increasingly likely that it would be useful to produce area-specific weights for a large number of areas.
That raises at least 3 questions to think about:
Here are some thoughts on each.
Automating creation of multiple sets of weights
Because IRS data for all Congressional Districts will be formatted similarly and all states will be formatted similarly, I expect that I will create one big file for each category, with an extra field with a code for the CD or state. I could even combine the two files into one, with the code determining whether it is a state or a CDIf we keep the current setup for areas, I'll just extract individual CDs and states into their own files, as needed (or perhaps create all files at once).
However, it might also make sense to set the program up to do many areas sequentially, driven by a file that has multiple areas (possibly a subset of the big file I just described, with areas and targets tailored to the problem at hand). In this case we'd want to have some sort of "apply" ability to apply a create-area-file function mutiple times.
This would create some new considerations because doing many areas and bulk will mean that we probably will not have the time (human time - yours and mine, not computer time) to give loving attention to each area and its targets.
Pre-optimization identification of potential problems and possible pre-optimization adjustments
If we do areas in bulk, we'll want some way to identify potential problems and possibly even adjust for them in an automated way.
Identification probably entails identifying targets for individual areas that appear hard, and identfiying areas that appear to have combinations of targets that are hard. At least intially, we'll want to use a fairly simple approach I think.
Identifying problematic targets:
Identifying problematic combinations of targets:
We'd want to check whether the problem cannot be solved to equality. I think a test for this sort of inconsistency would be to see whether rank of the matrix A equals the rank of the augmented matrix [A | b], which I think could be done in code with
rank(A) != rank([A | b])
(all subject to verification).We also might want some reasonableness tests - for example, a measure of whether a lot of targets are far from initial values - maybe some sort of simple statistic on the ratio of (Ax with x0=1) to b, such as the RMSE of the (ratio -1), or the fraction of targets where the (absolute value of ratio - 1) is greater than some number (e.g., 1.5).
The next question is what to do when we identify problematic targets or groups of targets. We might want to drop problematic targets, or come up with a way of downweighting them (by putting a weight for each target into the objective function, and reducing the weights of problemetic targets). Ideally we'd like to automate this but in the beginning we might have to do a lot of inspection to see what kind of automation would make sense.
If we decide that the targets as a group are problematic, at least initially we'd probably have to inspect manually and drop or downweight the worst targets, or drop inconsistent targets.
Post-optimization diagnosis of actual problems
Doing large numbers of CDs (or states) will mean it is valuable to have good post-optimization checks that make it easy to identify when results for a target in a given area, or when an area as a whole, are worrisome or implausible even when we believe the problem passed to the solver is reasonable (as a result of our pre-optimization examination).
There are two kinds of post-optimization tests we can do: (1) working within the set of targets we passed to the solver, and (2) broadening our examination to look for "innocent bystanders" - non-targeted values that appear implausible.
For the first kind of examination, we might try to set up a few simple screens that trigger manual examination of an area, or perhaps "yellow-flagging" an area as in "use with extreme caution". Screens might be (just illustrations):
The second kind of examination is potentially a lot more work and we might not get to it right away, but in the longer run it would be important. We might set up screens with the general format: