CPS BUG: XTOT not always equal to (nu18 + n1820 + n21)

In the course of commenting on Tax-Calculator issue 2630, I found that in the CPS data there are tax units for which XTOT is not equal to the sum of the three age-group-count variables: nu18, n1820, and n21.

The docstring for the relationships function in the tests/test_data.py module says this:

    Test the relationships between variables.

    Note (1): we have weakened the XTOT == sum of nu18, n1820, n21 assertion
    for the PUF because in PUF data the value of XTOT is capped by IRS-SOI.

    Note (2): we have weakened the n24 <= nu18 assertion for the PUF because
    the only way to ensure it held true would be to create extremely small
    bins during the tax unit matching process, which had the potential to
    reduce the overall match accuracy.

But Note (1) in this documentation has been inaccurate since the merge of PR #314 on 24 June 2020. The #314 changes included (without offering any rationale) a change from == to >= for CPS data in this code:

   if dataname == "CPS":
        m = eq_str.format(dataname, "XTOT", "sum of nu18, n1820, n21")
        assert np.all(data["XTOT"] >= nsums), m
    else:
        # see Note (1) in docstring
        m = less_than_str.format(dataname, "XTOT", "sum of nu18, n1820, n21")
        assert np.all(data["XTOT"] <= nsums), m

So, the SAS-generated CPS tax units did pass the == consistency test, but now the Python-generated CPS tax units are not being subjected to the == consistency test. Now I find roughly one thousand CPS tax units where the n* sum is not equal to XTOT. Here is the tabulation I did:

(taxcalc-dev) ~% tc cps.csv 2021 --sqldb --dvars dumpvars    
You loaded data for 2014.
Tax-Calculator startup automatically extrapolated your data to 2021.

(taxcalc-dev) ~% cat bug1.sql
(taxcalc-dev) agevars% cat bug1.sql
.mode column
.width -1
select count(*) as total_num_rows
  from dump;
select count(*) as num_rows_with_bug1
  from dump where XTOT!=(nu18+n1820+n21);

(taxcalc-dev) ~% sqlite3 cps-21-#-#-#.db <bug1.sql
total_num_rows
--------------
        280005
num_rows_with_bug1
------------------
              1095

To understand better the nature of the data inconsistencies (all of which involve XTOT being larger than the n* sum by one), here are the 1095 rows disaggregated:

diff  MARS  XTOT  n24  nu18  n1820  n21  count(*)
----  ----  ----  ---  ----  -----  ---  --------
   1     2     2    0     0      0    1         1 
   1     2     3    0     1      0    1         6
   1     2     3    1     1      0    1        55
   1     2     3    1     1      1    0         4
   1     2     3    1     2      0    0         1
   1     2     4    2     2      0    1        19
   1     2     5    2     3      0    1         3
   1     2     5    3     3      0    1        11
   1     2     6    4     4      0    1         3
   1     4     2    0     0      0    1        42
   1     4     3    0     0      1    1        30
   1     4     3    0     1      0    1        51 
   1     4     3    1     1      0    1       377
   1     4     3    1     1      1    0         9  
   1     4     4    0     0      2    1         2 
   1     4     4    0     1      1    1         5  
   1     4     4    1     1      0    2         1  
   1     4     4    1     1      1    1        15  
   1     4     4    1     2      0    1        21 
   1     4     4    2     2      0    1       259 
   1     4     4    2     2      1    0         3  
   1     4     5    0     1      1    2         1   
   1     4     5    1     1      1    2         1   
   1     4     5    1     1      2    1         1    
   1     4     5    2     2      1    1        13  
   1     4     5    2     3      0    1        11  
   1     4     5    3     3      0    1        86 
   1     4     6    2     3      1    1         2   
   1     4     6    3     3      1    1         6  
   1     4     6    3     4      0    1         6  
   1     4     6    4     4      0    1        33
   1     4     7    4     4      1    1         3   
   1     4     7    4     5      0    1         1   
   1     4     7    5     5      0    1         9  
   1     4     8    5     5      1    1         2   
   1     4    10    7     8      0    1         1  
   1     4    10    8     8      0    1         1

@andersonfrailey, I'm hoping you can fix this bug soon. Dealing with the inconsistent CPS data has been an enormous waste of my time (and, I imagine, a waste of other users' time). And, in addition to the wasted time, problems like this tend to erode user confidence in the data used by Tax-Calculator.

PSLmodels / taxdata

CPS BUG: XTOT not always equal to (nu18 + n1820 + n21) #408