Closed martinholmer closed 2 months ago
Thanks @martinholmer for reducing ftol, gtol. On my computer zz targets now fit within the target tolerance with just the initial iteration of the delta loop. Results below.
(tmd) tax-microdata-benchmarking$ python -m tmd.areas.create_area_weights "zz"
CREATING WEIGHTS FILE FOR AREA zz ...
INITIAL WEIGHTS STATISTICS:
sum of national weights = 1.840247e+08
area weights_scale = 9.871864e-02
USING zz_targets.csv FILE WITH 8 TARGETS
DISTRIBUTION OF TARGET ACT/EXP RATIOS (n=8):
low bin ratio high bin ratio bin # cum # bin % cum %
>= 0.400000, < 0.800000: 2 2 25.00% 25.00%
>= 0.800000, < 0.900000: 1 3 12.50% 37.50%
>= 0.900000, < 0.990000: 0 3 0.00% 37.50%
>= 0.990000, < 0.999500: 0 3 0.00% 37.50%
>= 0.999500, < 1.000500: 1 4 12.50% 50.00%
>= 1.000500, < 1.010000: 0 4 0.00% 50.00%
>= 1.010000, < 1.100000: 0 4 0.00% 50.00%
>= 1.100000, < 1.200000: 0 4 0.00% 50.00%
>= 1.200000, < 1.600000: 2 6 25.00% 75.00%
>= 1.600000, < 2.000000: 0 6 0.00% 75.00%
>= 2.000000, < 3.000000: 0 6 0.00% 75.00%
>= 3.000000, < 4.000000: 1 7 12.50% 87.50%
>= 4.000000, < 5.000000: 0 7 0.00% 87.50%
>= 5.000000, < inf: 1 8 12.50% 100.00%
US_PROPORTIONALLY_SCALED_TARGET_RMSE= 3.024013868e+01
target_matrix sparsity ratio = 0.476
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
OPTIMIZE WEIGHT RATIOS IN A REGULARIZATION LOOP
where REGULARIZATION DELTA starts at 1.000000e-09
and where target_matrix.shape= (225256, 8)
::loop,delta,misses,exectime(secs): 1 1.000000e-09 0 23.2
>>> final delta loop exectime= 23.2 secs iterations=536 success=True
>>> message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
>>> L-BFGS-B optimized objective function value: 7.394127110e-05
AREA-OPTIMIZED_TARGET_MISSES= 0
DISTRIBUTION OF TARGET ACT/EXP RATIOS (n=8):
with REGULARIZATION_DELTA= 1.000000e-09
low bin ratio high bin ratio bin # cum # bin % cum %
>= 0.999500, < 1.000500: 8 8 100.00% 100.00%
AREA-OPTIMIZED_TARGET_RMSE= 2.496976857e-04
DISTRIBUTION OF AREA/US WEIGHT RATIO (n=225256):
with REGULARIZATION_DELTA= 1.000000e-09
low bin ratio high bin ratio bin # cum # bin % cum %
>= 0.000000, < 0.000001: 2440 2440 1.08% 1.08%
>= 0.000001, < 0.100000: 47843 50283 21.24% 22.32%
>= 0.100000, < 0.200000: 1613 51896 0.72% 23.04%
>= 0.200000, < 0.500000: 5532 57428 2.46% 25.49%
>= 0.500000, < 0.800000: 10602 68030 4.71% 30.20%
>= 0.800000, < 0.850000: 5597 73627 2.48% 32.69%
>= 0.850000, < 0.900000: 11667 85294 5.18% 37.87%
>= 0.900000, < 0.950000: 14245 99539 6.32% 44.19%
>= 0.950000, < 1.000000: 27624 127163 12.26% 56.45%
>= 1.000000, < 1.050000: 25165 152328 11.17% 67.62%
>= 1.050000, < 1.100000: 14157 166485 6.28% 73.91%
>= 1.100000, < 1.150000: 10674 177159 4.74% 78.65%
>= 1.150000, < 1.200000: 13303 190462 5.91% 84.55%
>= 1.200000, < 2.000000: 33163 223625 14.72% 99.28%
>= 2.000000, < 5.000000: 1490 225115 0.66% 99.94%
>= 5.000000, < 10.000000: 112 225227 0.05% 99.99%
>= 10.000000, < 100.000000: 29 225256 0.01% 100.00%
SUM OF SQUARED AREA/US WEIGHT RATIO DEVIATIONS= 7.344248e+04
(tmd) tax-microdata-benchmarking$
@donboyd5, That's good that your zz computation did not enter the delta loop, but even with the smaller FTOL value, they are still different than what I get on my computer (because the reweighting of the national weights is different on your computer and my computer):
(taxcalc-dev) weights% diff zz.log zz.db
3,4c3,4
< sum of national weights = 1.839834e+08 <<<<<<<<<<<< MY REWEIGHT RESULTS
< area weights_scale = 9.874809e-02
---
> sum of national weights = 1.840247e+08 <<<<<<<<<<<< YOUR REWEIGHT RESULTS
> area weights_scale = 9.871864e-02
22c22
< US_PROPORTIONALLY_SCALED_TARGET_RMSE= 3.020677339e+01
---
> US_PROPORTIONALLY_SCALED_TARGET_RMSE= 3.024013868e+01
---[SNIP]---
@martinholmer Thanks. After I am back from vacation (Oct 11 return) I'll put together the csv file needed to use the L-BFGS-B method for the national data. I think that approach will be faster than SGD, provide a better solution, and allow us to pull weight ratios toward 1, which the SGD approach as implemented does not. Finally, I think it will be machine-independent or nearly so - right now I think SGD is not solving to a high degree of precision despite being allowed to take 2000 iterations. If it were solving to a high degree of precision - if the presumably minimized objective function were minimized within a small tolerance - then I think machine-dependent differences would virtually disappear because all machines would optimize within a small tolerance and tax data differences would be extremely small, and perhaps not visible. In any event, we will be able to see in a few weeks.
In response to the discussion of merged PR #203 ending with this comment.