Why are FLPDYR+h_seq+ffpos values not unique in cps.csv file?

martinholmer commented 6 years ago

Merged pull request #1635 added two new variables to the cps.csv file. Here is what was said in the taxdata repo about these two new variables:

This PR completes a request from @MattHJensen to add the household sequence and the unique family identifier variables, h_seq and ffpos, respectively, to the CPS file. They are names in accordance with the NBER documentation. Using the two in combination, you can identify individual families in the CPS.

For example, you may have a household with h_seq == 1 that contains two families. ffpos for the two families will be 1 and 2.

A corresponding Tax-Calculator PR [#1635] is forthcoming.

Combining the h_seq and ffpos values for each family will produce a unique identifier within a CPS sample. However, the cps.csv sample contains records from three different CPS samples (for different years). So, without knowing which of the three CPS samples each record is from, it is impossible to match a record in the cps.csv file with its complete information from a Census CPS file. Wasn't the whole idea behind adding these variables to allow users to get extra information from the Census CPS file to exact match with the records in the cps.csv file? If so, then it would seem as if they cannot do that.

The following tabulations illustrate the problem:

iMac:tab mrh$ conda install -c ospc taxcalc
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /Users/mrh/anaconda2:

The following NEW packages will be INSTALLED:

    taxcalc: 0.13.1-py27_0 ospc

Proceed ([y]/n)? y

taxcalc-0.13.1 100% |################################| Time: 0:00:04   9.07 MB/s

iMac:tab mrh$ tc cps.csv 2014 --sqldb
You loaded data for 2014.
Tax-Calculator startup automatically extrapolated your data to 2014.

iMac:tab mrh$ ls -l
total 512232
-rw-r--r--  1 mrh  staff        205 Nov 11 09:06 cps-14-#-#-doc.text
-rw-r--r--  1 mrh  staff   17681490 Nov 11 09:06 cps-14-#-#.csv
-rw-r--r--  1 mrh  staff  244576256 Nov 11 09:06 cps-14-#-#.db

iMac:tab mrh$ sqlite3 cps-14-#-#.db 
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select count(*) from dump;
456465
sqlite> select min(h_seq),max(h_seq) from dump;
1|99461
sqlite> select min(ffpos),max(ffpos) from dump;
1|11
sqlite> select x,count(*)
          from (select h_seq*1000+ffpos as x from dump)
      group by x having count(*)>1
      order by x limit 9;
2001|3
2002|2
3001|2
5001|4
8001|2
9001|2
10001|3
11001|2
13001|2
sqlite> select id,RECID
          from (select h_seq*1000+ffpos as id, RECID from dump)
      where id<10000 order by id;
1001|178879
1002|178880
2001|1
2001|178881
2001|178882
2002|2
2002|178883
3001|3
3001|292368
5001|4
5001|178884
5001|292369
5001|292370
6001|5
7001|292371
8001|6
8001|178885
9001|7
9001|292372
sqlite> .q

iMac:tab mrh$

So, for example, the three records with h_seq=2 and ffpos=1 have unique RECID values, but that doesn't help a user figure out which Census CPS file those three cps.csv records were drawn from.

@MattHJensen @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun

martinholmer commented 6 years ago

Another question about the new h_seq and ffpos variables in the cps.csv file:

Assuming that h_seq+ffpos is unique within each CPS sample, 
wouldn't a user expect there to be no more than three records 
with the same h_seq+ffpos value in the cps.csv file?

Maybe I'm confused, but that was what I was expecting. So the following tabulation made we wonder about what's going on.

iMac:junk mrh$ sqlite3 cps-14-#-#.db 
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select id,count(*)
          from (select h_seq*1000+ffpos as id from dump)
      group by id having count(*)>40
      order by id;
6809001|45
8449001|45
14597001|46
16499001|45
23946001|46
24486001|47
35717001|46
37989001|45
41250001|45
57442001|60
77610001|45
93498001|46
95198001|45
97102001|45
97439001|47
sqlite> .q

iMac:junk mrh$

Why is it that there are 60 records in the cps.csv file that have h_seq=57442 and ffpos=1? Am I doing the tabulations wrong? Or is there a reason there can be sixty records with the same h_seq+ffpos value?

@MattHJensen @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun

ernietedeschi commented 6 years ago

Don't we know based on FLPDYR in the original unaged cps.csv which CPS sample each is from? That is, it's from the FLPDYR + 1 ASEC?

To aid with CPS merging, I created a simple database with just RECID, h_seq, ffpos, and a new variable called yearcps equal to FLPDYR + 1.

martinholmer commented 6 years ago

@evtedeschi3 said:

Don't we know based on FLPDYR in the original unaged cps.csv which CPS sample each is from? That is, it's from the FLPDYR + 1 ASEC?

Yes, good point. But any Tax-Calculator output file will have all the filing units having the same FLPDYR value, which is set to tax analysis year. This is why you did the following with the cps.csv input file:

To aid with CPS merging, I created a simple database with just RECID, h_seq, ffpos, and a new variable called yearcps equal to FLPDYR + 1.

But when I combine FLPDYR with h_seq and ffpos, things are better but still not good enough to exact match other CPS variables to the cps.csv file. After converting the cps.csv file into an SQLite3 database, I get these results:

$ sqlite3 cps.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select count(*) from data;                                       
456465
sqlite> select id,count(*) from (select FLPDYR*1000000000+h_seq*1000+ffpos as id from data) group by id having count(*)>13 order by id limit 9;
2012000115001|15
2012000153001|16
2012000155001|15
2012000181001|15
2012000260001|15
2012000285001|15
2012000318001|15
2012000322001|15
2012000369001|15

Notice the limit 9; the number of FLPDYR/h_seq/ffpos values that are assigned to more the 13 filing units is far larger than 9.

Again, maybe I'm doing the calculations wrong, but it seems as if FLPDYR/h_seq/ffpos values are not unique.

Have you checked your simple database to see if the FLPDYR/h_seq/ffpos values are unique?

andersonfrailey commented 6 years ago

Thanks for pointing this out @martinholmer. @evtedeschi3 is correct that you would also need to account for the year of the CPS in the tabulations, which it appears you do in your most recent comment.

For my own clarification, when you converted cps.csv to an SQLite database and produced the numbers above, that is still the input file correct?

It's possible that some families may have been split into multiple tax units (think dependent filers particularly), in which case there would be multiple tax units with identical h_seq, ffpos, and FLPDYR, though I wouldn't expect it to happen as frequently as it appears to based on your tabulations.

I will also read through the CPS documentation to see if there are any identification variables which would be better for what we're trying to use h_seq and ffpos for.

Amy-Xu commented 6 years ago

Just to add one quick thought -- imputations to remove top-coding could be a source for h_seq + ffpos duplicates.

ernietedeschi commented 6 years ago

One thought: Are you trying to merge individual ASEC variables? Remember, there will be several individuals in each year/h_seq/ffpos combination.

ernietedeschi commented 6 years ago

Also, shifting to the cps.csv file, it is absolutely the case that in some instances it splits a family into two or more tax units.

To get true unique matching at the individual level, you would need to add a variable for the PULINENO of the head, and another one for the spouse.

martinholmer commented 6 years ago

@andersonfrailey, Here is how I converted cps.csv into cps.db:

import sqlite3
import pandas as pd

CSV_INPUT_FILENAME = 'cps.csv'
SQLDB_OUT_FILENAME = 'cps.db'

dframe = pd.read_csv(CSV_INPUT_FILENAME)
dbcon = sqlite3.connect(SQLDB_OUT_FILENAME)
dframe.to_sql('data', dbcon, if_exists='replace', index=False)
dbcon.close()

martinholmer commented 6 years ago

Thanks for all the comments on #1658. I can see that, for several reasons, CPS families are split into separate tax filing units. But consider the following tabulation and my question below the tabulation results:

$ sqlite3 cps.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select count(*) from data;
456465
sqlite> select id,count(*) from (select FLPDYR*1000000000+h_seq*1000+ffpos as id from data) group by id having count(*)>30 order by id;
2012054455001|31
2012088531001|31
2013001435001|31
2013030759001|31
2013053994001|31
2013084195001|31
2013095063001|31
2014020301001|31
2014024797001|31
2014025287001|32
2014028517001|31
2014034002001|31
2014036978001|32
2014044160001|32
2014057442001|45
2014060481001|31
2014072395001|33
2014080568001|31
2014089284001|32
sqlite> .q
$

What kind of family is split into more than thirty filing units?

@MattHJensen @Amy-Xu @andersonfrailey @hdoupe @evtedeschi3

ernietedeschi commented 6 years ago

So I've highlighted all 30 of the tax units created from one such family (this is extrapolated to 2026).

That's... a lot of very rich young people.

martinholmer commented 6 years ago

@evtedeschi3, Thanks for the extra information in issue #1658. The thirty filing units you highlighted are all from the March 2013 CPS record with h_seq=14597 and ffpos=1 as can be seen in the following tabulation:

$ sqlite3 cps.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select count(*) from data;
456465
sqlite> select count(*) from data where RECID>=317095 and RECID<=317124; 
30
sqlite> select RECID,FLPDYR,h_seq,ffpos from data where RECID>=317095 and RECID<=317124;
317095|2012|14597|1
317096|2012|14597|1
317097|2012|14597|1
317098|2012|14597|1
317099|2012|14597|1
317100|2012|14597|1
317101|2012|14597|1
317102|2012|14597|1
317103|2012|14597|1
317104|2012|14597|1
317105|2012|14597|1
317106|2012|14597|1
317107|2012|14597|1
317108|2012|14597|1
317109|2012|14597|1
317110|2012|14597|1
317111|2012|14597|1
317112|2012|14597|1
317113|2012|14597|1
317114|2012|14597|1
317115|2012|14597|1
317116|2012|14597|1
317117|2012|14597|1
317118|2012|14597|1
317119|2012|14597|1
317120|2012|14597|1
317121|2012|14597|1
317122|2012|14597|1
317123|2012|14597|1
317124|2012|14597|1
sqlite> .q
$

The characteristics of household, as you point out, are very unusual. First of all, there are 45 people (some single and some married) living at the same address and they are all considered by Census to be in the same family. And, as you point out, most have very high incomes. So, is this a mistake in the preparation of the cps.csv file or has Census actually sampled a rich commune? Or maybe there is a different explanation.

@MattHJensen @Amy-Xu @andersonfrailey @hdoupe

ernietedeschi commented 6 years ago

It’s particularly strange because the CPS by design doesn’t sample group quarters. So it’s not e.g. a dormitory or a barracks.

Amy-Xu commented 6 years ago

This might relate to the top-coding imputation I mentioned earlier. The algorithm is documented like so:

This top-coding imputation is implemented for records with high income.

Source: http://www.quantria.com/assets/img/TechnicalDocumentationV4-2.pdf

andersonfrailey commented 6 years ago

@Amy-Xu beat me to it. We also have that noted in our documentation for the CPS file.

The one change between what we do and what she posted is we repeat the process 15 times rather than 10.

ernietedeschi commented 6 years ago

So that makes perfect sense, but that still means we have families of 30+ people who, evidently in this case, all have topcoded incomes! (The swap cutoff for the ASEC in 2013 was a personal wage of $250,000)

Amy-Xu commented 6 years ago

@evtedeschi3 Why do you think we have families of 30+ people? Looking at this post by Martin and the 15 highlighted by Anderson, I'm guessing most likely we have a dozen of high-income families of 3-4 people.

martinholmer commented 6 years ago

@andersonfrailey pointed to the taxdata documentation of the cps.csv file which says this:

Top-Coding

Certain income items in the CPS are top-coded to protect the identity of the individuals surveyed. Each tax unit with an individual who has been top-coded is flagged for further processing to account for this.

For the records that have been flagged, the top-coded value is replaced with a random variable whose mean is the top-coded amount. This process is repeated 15 times per flagged record, resulting in the record being represented 15 times in the production file with a different value for the top-coded fields in each and their weight divided by 15.

Thanks for the this helpful information. I still have two questions:

(1) The above documentation explains why there are duplicate cpsyear+h_sql+ffpos values in multiples of 15, 30 and 45.

What's the story for the many cps.csv filing units like these?

sqlite> select id,count(*) from (select FLPDYR*1000000000+h_seq*1000+ffpos as id from data) group by id having (count(*)>1 and count(*)!=15 and count(*)!=30 and count(*)!=45) order by id limit 9;
2012000005001|2
2012000010001|2
2012000017001|2
2012000021001|3
2012000031001|2
2012000059001|3
2012000067001|2
2012000077001|3
2012000079001|2

There's no top-coding, so what explains the fact that several filing units have the same cpsyear+h_sql+ffpos value?

(2) This top-coding documentation raises a completely different issue. A variable being top-coded means the actual value of that variable is larger than the top-coded amount, right? If so, then when the top-coded value is replaced with a random variable whose mean is the top-coded amount, many of the randomly assigned values will be less than the top-coded amount. That is an enormous bias, especially when the reform everybody is interested in right now contains major changes in the tax treatment of high-income filers.

@MattHJensen @Amy-Xu @hdoupe @codykallen @evtedeschi3

ernietedeschi commented 6 years ago

@martinholmer wrote

(2) This top-coding documentation raises a completely different issue. A variable being top-coded means the actual value of that variable is larger than the top-coded amount, right? If so, then when the top-coded value is replaced with a random variable whose mean is the top-coded amount, many of the randomly assigned values will be less than the top-coded amount. That is an enormous bias, especially when the reform everybody is interested in right now contains major changes in the tax treatment of high-income filers.

My impression is that the ASEC has not been strictly "top-coded" post-2011. Rather, the top-codes really act more as thresholds. Above them, values are swapped with other super-top-coded values to protect identities.

IPUMS has a very useful description of how this changed over time:

Starting in 2011, the Census Bureau shifted from the average replacement value system to a rank proximity swapping procedure. In this technique, all values greater than or equal to the income topcode are ranked from lowest to highest and systematically swapped with other values within a bounded interval. All swapped values are also rounded to two significant digits.

martinholmer commented 6 years ago

Thanks, @Amy-Xu, for citing John's documentation of the CPS tax file. The discussion before the passage you quoted describes the Census CPS top-coding procedure, as @evtedeschi3 did in his helpful comment. That information has clearer up several misconceptions I had about how Census handles high income amounts.

But I have a couple of remaining questions about how the cps.csv file handles top-coded amounts. Here is what John says:

Top coding in this way [by Census] means that aggregate control totals derived from the CPS are meaningful in that they represent an estimate of the population total. However, in order to make the data suitable for tax analysis, we remove the top coding for each record and create “replicate” tax units for each top coded record according to the following algorithm:

Question (1): Exactly why do we need to replace the top-coded amount with 15 replicates? Exactly why are the Census top-coded amounts not "suitable for tax analysis"?

John continues:

We search each of the income fields in the tax unit and identify those amounts that have been top coded and the tax unit is “flagged” for subsequent processing to remove the top coding.

We replace each top coded amount with a lognormal random variable whose mean is the top coded amount and whose variance is initially set to equal the standard error of the estimate (SEE) of the regression equation. We then multiply the SEE by a factor to more closely match published SOI totals of the income distribution at the upper end of the income scale.

Question (2): What "regression equation" is being referred to in the second item? I looked but didn't find the one being referred to here. Did I miss it?

@MattHJensen @andersonfrailey

martinholmer commented 6 years ago

Now that I understand that 15 (or 30 or 45) replicates have been used to replace a single CPS family with high income, the range of my questions is much narrower.

What's the story behind the many cps.csv filing units like these?

sqlite> select id,count(*) 
          from (select FLPDYR*1000000000+h_seq*1000+ffpos as id from data) 
          group by id having (count(*)>1 and count(*)<15) order by id
          limit 9;
2012000005001|2
2012000010001|2
2012000017001|2
2012000021001|3
2012000031001|2
2012000059001|3
2012000067001|2
2012000077001|3
2012000079001|2

There's no top-coding (because there are only two or three duplicates among the first nine of many), so what explains the fact that these cps.csv filing units have the same cpsyear+h_sql+ffpos value?

Amy-Xu commented 6 years ago

@martinholmer my intuition is that these 2-3 replicates are from families with more than one filing units, but not necessarily high-income. Say a 20-year-old daughter has a part-time job and might just file a separate return from her parents, while all three of them just 'normal' income people.

Amy-Xu commented 6 years ago

@martinholmer also asked:

Question (2): What "regression equation" is being referred to in the second item? I looked but didn't find the one being referred to here. Did I miss it?

My memory is a bit rusty, but I roughly remember that those averages are predictions from regressions of income on gender, race and work experience. If you take a look at 2010 doc chart #2, it gives the predictions by cell. And the standard error most likely refers to the error of predictions.

If that's true, then it seems there's another problem we have here. Census only use this top coding method for CPS ASEC prior to 2011, which has nothing to do our current file that includes 2013-2015 CPS. The old top coding method isn't good for tax analysis (I guess) because distributions are collapsed into averages. Restoring the distribution was a doable and sensible option. But now the top coding has been revised to a swapping method. I don't know whether restoring the distribution through the old method still makes sense.

martinholmer commented 6 years ago

@Amy-Xu said in issue #1658:

my intuition is that these 2-3 replicates are from families with more than one filing unit, but not necessarily high-income. Say a 20-year-old daughter has a part-time job and might just file a separate return from her parents, while all three of them just 'normal' income people.

Thanks for the explanation, which seems perfectly sensible.

martinholmer commented 6 years ago

@Amy-Xu said in issue #1658:

My memory is a bit rusty, but I roughly remember that those averages are predictions from regressions of income on gender, race and work experience. If you take a look at 2010 doc Chart 2, it gives the predictions by cell. And the standard error most likely refers to the error of predictions.

Chart 2 simply shows the average value above the top-coding threshold. There's no regression results in either Chart 1 or Chart 2, as you can see from this reproduction of those tables:

martinholmer commented 6 years ago

@Amy-Xu said in issue #1658:

[I]t seems there's another problem we have here. Census only use this top coding method for CPS ASEC prior to 2011, which has nothing to do our current file that includes 2013-2015 CPS. The old top coding method isn't good for tax analysis (I guess) because distributions [above the top-coding thresholds] are collapsed into averages. Restoring the distribution was a doable and sensible option. But now the top coding has been revised to a swapping method. I don't know whether restoring the distribution through the old method still makes sense.

That why I asked why all the replicates were being generated. Unless I'm missing something, there is no reason to go through all this replicates business for our more recent CPS files. And it's worse than just being inefficient, it's wrong. Because the regression imputation scheme used to construct the cps.csv file can often replace a top-coded value, which is, by definition, above the top-coding threshold, with a value that is below the top-coding threshold.

I'm starting to wonder if this is a contributing factor to the understatement of income tax revenue when using the cps.csv file as Tax-Calculator input.

@MattHJensen @andersonfrailey @hdoupe @codykallen @evtedeschi3

MattHJensen commented 6 years ago

@martinholmer said:

I'm starting to wonder if this is a contributing factor to the understatement of income tax revenue when using the cps.csv file as Tax-Calculator input.

That seems possible. Should a new issue be opened to describe the change that should be made to the file prep?

Amy-Xu commented 6 years ago

Martin said:

there is no reason to go through all this replicates business for our more recent CPS files.

I agree that it seems we should turn off this imputation for top-coding removal, given that the top coding method has been updated in more recent CPS ASEC files.

martinholmer commented 6 years ago

@MattHJensen said:

I'm starting to wonder if this is a contributing factor to the understatement of income tax revenue when using the cps.csv file as Tax-Calculator input.

That seems possible. Should a new issue be opened to describe the change that should be made to the file prep?

I'm not sure where in the sequence of processing the three raw CPS files into the cps.csv file that the 15 replicates replace a single top-coded observation. @andersonfrailey, where in the sequence of scripts you list in this README.md file are the top-coded replicates added? And more to the point, what are your views on the thoughts first expressed by @Amy-Xu that the whole top-coded replicates business is unnecessary for our three (post-2011) CPS files?

andersonfrailey commented 6 years ago

@martinholmer asked:

where in the sequence of scripts you list in this README.md file are the top-coded replicates added?

This occurs in the TopCodingV1.sas scripts. This is after tax units have been created and adjustments are being made to the final file.

And

what are your views on the thoughts first expressed by @Amy-Xu that the whole top-coded replicates business is unnecessary for our three (post-2011) CPS files?

I talked with John about this a few months ago. His opinion was that there wasn't a huge need to revise/remove the top coding scripts, though he hadn't run the numbers to verify this.

I'm in favor of at least comparing the results that we would get from removing this part of the creation process. There are two problems preventing this from happening immediately though. First, our SAS license has expired and I just checked and the general AEI one has as well. Second, John hasn't sent me the script to recreate the weights file yet and I haven't completed my work on a python version yet so we can't create a new weights file at this time.

martinholmer commented 6 years ago

@andersonfrailey said almost thee week ago in Tax-Calculator issue #1658:

I talked with John about this a few months ago. His opinion was that there wasn't a huge need to revise/remove the top coding scripts, though he hadn't run the numbers to verify this.

I'm in favor of at least comparing the results that we would get from removing this part of the creation process. There are two problems preventing this from happening immediately though. First, our SAS license has expired and I just checked and the general AEI one has as well. Second, John hasn't sent me the script to recreate the weights file yet and I haven't completed my work on a python version yet so we can't create a new weights file at this time.

@andersonfrailey, Is this a formal issue in the taxdata repository? If not, should you create one? If so, should we close this issue (and have the taxdata issue point to it for the historical record)?

martinholmer commented 6 years ago

@martinholmer asked:

@andersonfrailey said almost thee week ago in Tax-Calculator issue #1658:

I talked with John about this a few months ago. His opinion was that there wasn't a huge need to revise/remove the top coding scripts, though he hadn't run the numbers to verify this.

I'm in favor of at least comparing the results that we would get from removing this part of the creation process. There are two problems preventing this from happening immediately though. First, our SAS license has expired and I just checked and the general AEI one has as well. Second, John hasn't sent me the script to recreate the weights file yet and I haven't completed my work on a python version yet so we can't create a new weights file at this time.

@andersonfrailey, Is this a formal issue in the taxdata repository? If not, should you create one? If so, should we close this issue (and have the taxdata issue point to it for the historical record)?

@andersonfrailey, Is the issue in Tax-Calculator #1658 covered in the Open-Source-Economics/taxdata#125 issue?

andersonfrailey commented 6 years ago

@martinholmer asked:

Is the issue in Tax-Calculator #1658 covered in the Open-Source-Economics/taxdata#125 issue?

In my opinion it is not. The two are related and the issue in #1658 (removing/revamping our top coding routine) is dependent on TaxData issue #125, but it will not be completely addressed in that issue. I would prefer a separate issue is opened when we begin to work on reviewing our top coding routine.

martinholmer commented 6 years ago

Closing issue #1658 in favor of open taxdata issue 174. Please continue the discussion of CPS top-coding in 174.

PSLmodels / Tax-Calculator

Why are FLPDYR+h_seq+ffpos values not unique in cps.csv file? #1658