In the course of commenting on Tax-Calculator issue 2630, I found that in the CPS data there are tax units for which XTOT is not equal to the sum of the three age-group-count variables: nu18, n1820, and n21.
The docstring for the relationships function in the tests/test_data.py module says this:
Test the relationships between variables.
Note (1): we have weakened the XTOT == sum of nu18, n1820, n21 assertion
for the PUF because in PUF data the value of XTOT is capped by IRS-SOI.
Note (2): we have weakened the n24 <= nu18 assertion for the PUF because
the only way to ensure it held true would be to create extremely small
bins during the tax unit matching process, which had the potential to
reduce the overall match accuracy.
But Note (1) in this documentation has been inaccurate since the merge of PR #314 on 24 June 2020. The #314 changes included (without offering any rationale) a change from == to >= for CPS data in this code:
if dataname == "CPS":
m = eq_str.format(dataname, "XTOT", "sum of nu18, n1820, n21")
assert np.all(data["XTOT"] >= nsums), m
else:
# see Note (1) in docstring
m = less_than_str.format(dataname, "XTOT", "sum of nu18, n1820, n21")
assert np.all(data["XTOT"] <= nsums), m
So, the SAS-generated CPS tax units did pass the == consistency test, but now the Python-generated CPS tax units are not being subjected to the == consistency test. Now I find roughly one thousand CPS tax units where the n* sum is not equal to XTOT. Here is the tabulation I did:
(taxcalc-dev) ~% tc cps.csv 2021 --sqldb --dvars dumpvars
You loaded data for 2014.
Tax-Calculator startup automatically extrapolated your data to 2021.
(taxcalc-dev) ~% cat bug1.sql
(taxcalc-dev) agevars% cat bug1.sql
.mode column
.width -1
select count(*) as total_num_rows
from dump;
select count(*) as num_rows_with_bug1
from dump where XTOT!=(nu18+n1820+n21);
(taxcalc-dev) ~% sqlite3 cps-21-#-#-#.db <bug1.sql
total_num_rows
--------------
280005
num_rows_with_bug1
------------------
1095
To understand better the nature of the data inconsistencies (all of which involve XTOT being larger than the n* sum by one), here are the 1095 rows disaggregated:
@andersonfrailey, I'm hoping you can fix this bug soon. Dealing with the inconsistent CPS data has been an enormous waste of my time (and, I imagine, a waste of other users' time). And, in addition to the wasted time, problems like this tend to erode user confidence in the data used by Tax-Calculator.
In the course of commenting on Tax-Calculator issue 2630, I found that in the CPS data there are tax units for which XTOT is not equal to the sum of the three age-group-count variables: nu18, n1820, and n21.
The docstring for the
relationships
function in thetests/test_data.py
module says this:But Note (1) in this documentation has been inaccurate since the merge of PR #314 on 24 June 2020. The #314 changes included (without offering any rationale) a change from
==
to>=
for CPS data in this code:So, the SAS-generated CPS tax units did pass the
==
consistency test, but now the Python-generated CPS tax units are not being subjected to the==
consistency test. Now I find roughly one thousand CPS tax units where then*
sum is not equal toXTOT
. Here is the tabulation I did:To understand better the nature of the data inconsistencies (all of which involve
XTOT
being larger than then*
sum by one), here are the 1095 rows disaggregated:@andersonfrailey, I'm hoping you can fix this bug soon. Dealing with the inconsistent CPS data has been an enormous waste of my time (and, I imagine, a waste of other users' time). And, in addition to the wasted time, problems like this tend to erode user confidence in the data used by Tax-Calculator.