Closed martinholmer closed 6 years ago
Another question about the new h_seq
and ffpos
variables in the cps.csv
file:
Assuming that h_seq+ffpos is unique within each CPS sample,
wouldn't a user expect there to be no more than three records
with the same h_seq+ffpos value in the cps.csv file?
Maybe I'm confused, but that was what I was expecting. So the following tabulation made we wonder about what's going on.
iMac:junk mrh$ sqlite3 cps-14-#-#.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select id,count(*)
from (select h_seq*1000+ffpos as id from dump)
group by id having count(*)>40
order by id;
6809001|45
8449001|45
14597001|46
16499001|45
23946001|46
24486001|47
35717001|46
37989001|45
41250001|45
57442001|60
77610001|45
93498001|46
95198001|45
97102001|45
97439001|47
sqlite> .q
iMac:junk mrh$
Why is it that there are 60 records in the cps.csv
file that have h_seq=57442 and ffpos=1?
Am I doing the tabulations wrong?
Or is there a reason there can be sixty records with the same h_seq+ffpos value?
@MattHJensen @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun
Don't we know based on FLPDYR
in the original unaged cps.csv which CPS sample each is from? That is, it's from the FLPDYR + 1
ASEC?
To aid with CPS merging, I created a simple database with just RECID
, h_seq
, ffpos
, and a new variable called yearcps
equal to FLPDYR + 1
.
@evtedeschi3 said:
Don't we know based on
FLPDYR
in the original unagedcps.csv
which CPS sample each is from? That is, it's from theFLPDYR + 1
ASEC?
Yes, good point. But any Tax-Calculator output file will have all the filing units having the same FLPDYR
value, which is set to tax analysis year. This is why you did the following with the cps.csv
input file:
To aid with CPS merging, I created a simple database with just
RECID
,h_seq
,ffpos
, and a new variable calledyearcps
equal toFLPDYR + 1
.
But when I combine FLPDYR
with h_seq
and ffpos
, things are better but still not good enough to exact match other CPS variables to the cps.csv
file. After converting the cps.csv
file into an SQLite3 database, I get these results:
$ sqlite3 cps.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select count(*) from data;
456465
sqlite> select id,count(*) from (select FLPDYR*1000000000+h_seq*1000+ffpos as id from data) group by id having count(*)>13 order by id limit 9;
2012000115001|15
2012000153001|16
2012000155001|15
2012000181001|15
2012000260001|15
2012000285001|15
2012000318001|15
2012000322001|15
2012000369001|15
Notice the limit 9
; the number of FLPDYR/h_seq/ffpos values that are assigned to more the 13 filing units is far larger than 9.
Again, maybe I'm doing the calculations wrong, but it seems as if FLPDYR/h_seq/ffpos values are not unique.
Have you checked your simple database
to see if the FLPDYR/h_seq/ffpos values are unique?
Thanks for pointing this out @martinholmer. @evtedeschi3 is correct that you would also need to account for the year of the CPS in the tabulations, which it appears you do in your most recent comment.
For my own clarification, when you converted cps.csv to an SQLite database and produced the numbers above, that is still the input file correct?
It's possible that some families may have been split into multiple tax units (think dependent filers particularly), in which case there would be multiple tax units with identical h_seq
, ffpos
, and FLPDYR
, though I wouldn't expect it to happen as frequently as it appears to based on your tabulations.
I will also read through the CPS documentation to see if there are any identification variables which would be better for what we're trying to use h_seq
and ffpos
for.
Just to add one quick thought -- imputations to remove top-coding could be a source for h_seq
+ ffpos
duplicates.
One thought: Are you trying to merge individual ASEC variables? Remember, there will be several individuals in each year/h_seq/ffpos combination.
Also, shifting to the cps.csv file, it is absolutely the case that in some instances it splits a family into two or more tax units.
To get true unique matching at the individual level, you would need to add a variable for the PULINENO of the head, and another one for the spouse.
@andersonfrailey, Here is how I converted cps.csv
into cps.db
:
import sqlite3
import pandas as pd
CSV_INPUT_FILENAME = 'cps.csv'
SQLDB_OUT_FILENAME = 'cps.db'
dframe = pd.read_csv(CSV_INPUT_FILENAME)
dbcon = sqlite3.connect(SQLDB_OUT_FILENAME)
dframe.to_sql('data', dbcon, if_exists='replace', index=False)
dbcon.close()
Thanks for all the comments on #1658. I can see that, for several reasons, CPS families are split into separate tax filing units. But consider the following tabulation and my question below the tabulation results:
$ sqlite3 cps.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select count(*) from data;
456465
sqlite> select id,count(*) from (select FLPDYR*1000000000+h_seq*1000+ffpos as id from data) group by id having count(*)>30 order by id;
2012054455001|31
2012088531001|31
2013001435001|31
2013030759001|31
2013053994001|31
2013084195001|31
2013095063001|31
2014020301001|31
2014024797001|31
2014025287001|32
2014028517001|31
2014034002001|31
2014036978001|32
2014044160001|32
2014057442001|45
2014060481001|31
2014072395001|33
2014080568001|31
2014089284001|32
sqlite> .q
$
What kind of family is split into more than thirty filing units?
@MattHJensen @Amy-Xu @andersonfrailey @hdoupe @evtedeschi3
So I've highlighted all 30 of the tax units created from one such family (this is extrapolated to 2026).
That's... a lot of very rich young people.
@evtedeschi3, Thanks for the extra information in issue #1658. The thirty filing units you highlighted are all from the March 2013 CPS record with h_seq=14597 and ffpos=1 as can be seen in the following tabulation:
$ sqlite3 cps.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select count(*) from data;
456465
sqlite> select count(*) from data where RECID>=317095 and RECID<=317124;
30
sqlite> select RECID,FLPDYR,h_seq,ffpos from data where RECID>=317095 and RECID<=317124;
317095|2012|14597|1
317096|2012|14597|1
317097|2012|14597|1
317098|2012|14597|1
317099|2012|14597|1
317100|2012|14597|1
317101|2012|14597|1
317102|2012|14597|1
317103|2012|14597|1
317104|2012|14597|1
317105|2012|14597|1
317106|2012|14597|1
317107|2012|14597|1
317108|2012|14597|1
317109|2012|14597|1
317110|2012|14597|1
317111|2012|14597|1
317112|2012|14597|1
317113|2012|14597|1
317114|2012|14597|1
317115|2012|14597|1
317116|2012|14597|1
317117|2012|14597|1
317118|2012|14597|1
317119|2012|14597|1
317120|2012|14597|1
317121|2012|14597|1
317122|2012|14597|1
317123|2012|14597|1
317124|2012|14597|1
sqlite> .q
$
The characteristics of household, as you point out, are very unusual. First of all, there are 45 people (some single and some married) living at the same address and they are all considered by Census to be in the same family. And, as you point out, most have very high incomes. So, is this a mistake in the preparation of the cps.csv
file or has Census actually sampled a rich commune? Or maybe there is a different explanation.
@MattHJensen @Amy-Xu @andersonfrailey @hdoupe
It’s particularly strange because the CPS by design doesn’t sample group quarters. So it’s not e.g. a dormitory or a barracks.
This might relate to the top-coding imputation I mentioned earlier. The algorithm is documented like so:
This top-coding imputation is implemented for records with high income.
Source: http://www.quantria.com/assets/img/TechnicalDocumentationV4-2.pdf
@Amy-Xu beat me to it. We also have that noted in our documentation for the CPS file.
The one change between what we do and what she posted is we repeat the process 15 times rather than 10.
So that makes perfect sense, but that still means we have families of 30+ people who, evidently in this case, all have topcoded incomes! (The swap cutoff for the ASEC in 2013 was a personal wage of $250,000)
@evtedeschi3 Why do you think we have families of 30+ people? Looking at this post by Martin and the 15 highlighted by Anderson, I'm guessing most likely we have a dozen of high-income families of 3-4 people.
@andersonfrailey pointed to the taxdata documentation of the cps.csv
file which says this:
Top-Coding
Certain income items in the CPS are top-coded to protect the identity of the individuals surveyed. Each tax unit with an individual who has been top-coded is flagged for further processing to account for this.
For the records that have been flagged, the top-coded value is replaced with a random variable whose mean is the top-coded amount. This process is repeated 15 times per flagged record, resulting in the record being represented 15 times in the production file with a different value for the top-coded fields in each and their weight divided by 15.
Thanks for the this helpful information. I still have two questions:
(1) The above documentation explains why there are duplicate cpsyear+h_sql+ffpos values in multiples of 15, 30 and 45.
What's the story for the many cps.csv
filing units like these?
sqlite> select id,count(*) from (select FLPDYR*1000000000+h_seq*1000+ffpos as id from data) group by id having (count(*)>1 and count(*)!=15 and count(*)!=30 and count(*)!=45) order by id limit 9;
2012000005001|2
2012000010001|2
2012000017001|2
2012000021001|3
2012000031001|2
2012000059001|3
2012000067001|2
2012000077001|3
2012000079001|2
There's no top-coding, so what explains the fact that several filing units have the same cpsyear+h_sql+ffpos value?
(2) This top-coding documentation raises a completely different issue. A variable being top-coded means the actual value of that variable is larger than the top-coded amount, right? If so, then when the top-coded value is replaced with a random variable whose mean is the top-coded amount
, many of the randomly assigned values will be less than the top-coded amount. That is an enormous bias, especially when the reform everybody is interested in right now contains major changes in the tax treatment of high-income filers.
@MattHJensen @Amy-Xu @hdoupe @codykallen @evtedeschi3
@martinholmer wrote
(2) This top-coding documentation raises a completely different issue. A variable being top-coded means the actual value of that variable is larger than the top-coded amount, right? If so, then when the top-coded value is replaced with a random variable whose mean is the top-coded amount, many of the randomly assigned values will be less than the top-coded amount. That is an enormous bias, especially when the reform everybody is interested in right now contains major changes in the tax treatment of high-income filers.
My impression is that the ASEC has not been strictly "top-coded" post-2011. Rather, the top-codes really act more as thresholds. Above them, values are swapped with other super-top-coded values to protect identities.
IPUMS has a very useful description of how this changed over time:
Starting in 2011, the Census Bureau shifted from the average replacement value system to a rank proximity swapping procedure. In this technique, all values greater than or equal to the income topcode are ranked from lowest to highest and systematically swapped with other values within a bounded interval. All swapped values are also rounded to two significant digits.
Thanks, @Amy-Xu, for citing John's documentation of the CPS tax file. The discussion before the passage you quoted describes the Census CPS top-coding procedure, as @evtedeschi3 did in his helpful comment. That information has clearer up several misconceptions I had about how Census handles high income amounts.
But I have a couple of remaining questions about how the cps.csv
file handles top-coded amounts.
Here is what John says:
Top coding in this way [by Census] means that aggregate control totals derived from the CPS are meaningful in that they represent an estimate of the population total. However, in order to make the data suitable for tax analysis, we remove the top coding for each record and create “replicate” tax units for each top coded record according to the following algorithm:
Question (1): Exactly why do we need to replace the top-coded amount with 15 replicates? Exactly why are the Census top-coded amounts not "suitable for tax analysis"?
John continues:
- We search each of the income fields in the tax unit and identify those amounts that have been top coded and the tax unit is “flagged” for subsequent processing to remove the top coding.
- We replace each top coded amount with a lognormal random variable whose mean is the top coded amount and whose variance is initially set to equal the standard error of the estimate (SEE) of the regression equation. We then multiply the SEE by a factor to more closely match published SOI totals of the income distribution at the upper end of the income scale.
Question (2): What "regression equation" is being referred to in the second item? I looked but didn't find the one being referred to here. Did I miss it?
@MattHJensen @andersonfrailey
Now that I understand that 15 (or 30 or 45) replicates have been used to replace a single CPS family with high income, the range of my questions is much narrower.
What's the story behind the many cps.csv
filing units like these?
sqlite> select id,count(*)
from (select FLPDYR*1000000000+h_seq*1000+ffpos as id from data)
group by id having (count(*)>1 and count(*)<15) order by id
limit 9;
2012000005001|2
2012000010001|2
2012000017001|2
2012000021001|3
2012000031001|2
2012000059001|3
2012000067001|2
2012000077001|3
2012000079001|2
There's no top-coding (because there are only two or three duplicates among the first nine of many), so what explains the fact that these cps.csv
filing units have the same cpsyear+h_sql+ffpos value?
@martinholmer my intuition is that these 2-3 replicates are from families with more than one filing units, but not necessarily high-income. Say a 20-year-old daughter has a part-time job and might just file a separate return from her parents, while all three of them just 'normal' income people.
@martinholmer also asked:
Question (2): What "regression equation" is being referred to in the second item? I looked but didn't find the one being referred to here. Did I miss it?
My memory is a bit rusty, but I roughly remember that those averages are predictions from regressions of income on gender, race and work experience. If you take a look at 2010 doc chart #2
, it gives the predictions by cell. And the standard error most likely refers to the error of predictions.
If that's true, then it seems there's another problem we have here. Census only use this top coding method for CPS ASEC prior to 2011, which has nothing to do our current file that includes 2013-2015 CPS. The old top coding method isn't good for tax analysis (I guess) because distributions are collapsed into averages. Restoring the distribution was a doable and sensible option. But now the top coding has been revised to a swapping method. I don't know whether restoring the distribution through the old method still makes sense.
@Amy-Xu said in issue #1658:
my intuition is that these 2-3 replicates are from families with more than one filing unit, but not necessarily high-income. Say a 20-year-old daughter has a part-time job and might just file a separate return from her parents, while all three of them just 'normal' income people.
Thanks for the explanation, which seems perfectly sensible.
@Amy-Xu said in issue #1658:
My memory is a bit rusty, but I roughly remember that those averages are predictions from regressions of income on gender, race and work experience. If you take a look at 2010 doc Chart 2, it gives the predictions by cell. And the standard error most likely refers to the error of predictions.
Chart 2 simply shows the average value above the top-coding threshold. There's no regression results in either Chart 1 or Chart 2, as you can see from this reproduction of those tables:
@Amy-Xu said in issue #1658:
[I]t seems there's another problem we have here. Census only use this top coding method for CPS ASEC prior to 2011, which has nothing to do our current file that includes 2013-2015 CPS. The old top coding method isn't good for tax analysis (I guess) because distributions [above the top-coding thresholds] are collapsed into averages. Restoring the distribution was a doable and sensible option. But now the top coding has been revised to a swapping method. I don't know whether restoring the distribution through the old method still makes sense.
That why I asked why all the replicates were being generated. Unless I'm missing something, there is no reason to go through all this replicates business for our more recent CPS files. And it's worse than just being inefficient, it's wrong. Because the regression imputation scheme used to construct the cps.csv
file can often replace a top-coded value, which is, by definition, above the top-coding threshold, with a value that is below the top-coding threshold.
I'm starting to wonder if this is a contributing factor to the understatement of income tax revenue when using the cps.csv
file as Tax-Calculator input.
@MattHJensen @andersonfrailey @hdoupe @codykallen @evtedeschi3
@martinholmer said:
I'm starting to wonder if this is a contributing factor to the understatement of income tax revenue when using the cps.csv file as Tax-Calculator input.
That seems possible. Should a new issue be opened to describe the change that should be made to the file prep?
Martin said:
there is no reason to go through all this replicates business for our more recent CPS files.
I agree that it seems we should turn off this imputation for top-coding removal, given that the top coding method has been updated in more recent CPS ASEC files.
@MattHJensen said:
I'm starting to wonder if this is a contributing factor to the understatement of income tax revenue when using the
cps.csv
file as Tax-Calculator input.That seems possible. Should a new issue be opened to describe the change that should be made to the file prep?
I'm not sure where in the sequence of processing the three raw CPS files into the cps.csv
file that the 15 replicates replace a single top-coded observation. @andersonfrailey, where in the sequence of scripts you list in this README.md file are the top-coded replicates added? And more to the point, what are your views on the thoughts first expressed by @Amy-Xu that the whole top-coded replicates business is unnecessary for our three (post-2011) CPS files?
@martinholmer asked:
where in the sequence of scripts you list in this README.md file are the top-coded replicates added?
This occurs in the TopCodingV1.sas scripts. This is after tax units have been created and adjustments are being made to the final file.
And
what are your views on the thoughts first expressed by @Amy-Xu that the whole top-coded replicates business is unnecessary for our three (post-2011) CPS files?
I talked with John about this a few months ago. His opinion was that there wasn't a huge need to revise/remove the top coding scripts, though he hadn't run the numbers to verify this.
I'm in favor of at least comparing the results that we would get from removing this part of the creation process. There are two problems preventing this from happening immediately though. First, our SAS license has expired and I just checked and the general AEI one has as well. Second, John hasn't sent me the script to recreate the weights file yet and I haven't completed my work on a python version yet so we can't create a new weights file at this time.
@andersonfrailey said almost thee week ago in Tax-Calculator issue #1658:
I talked with John about this a few months ago. His opinion was that there wasn't a huge need to revise/remove the top coding scripts, though he hadn't run the numbers to verify this.
I'm in favor of at least comparing the results that we would get from removing this part of the creation process. There are two problems preventing this from happening immediately though. First, our SAS license has expired and I just checked and the general AEI one has as well. Second, John hasn't sent me the script to recreate the weights file yet and I haven't completed my work on a python version yet so we can't create a new weights file at this time.
@andersonfrailey, Is this a formal issue in the taxdata repository? If not, should you create one? If so, should we close this issue (and have the taxdata issue point to it for the historical record)?
@martinholmer asked:
@andersonfrailey said almost thee week ago in Tax-Calculator issue #1658:
I talked with John about this a few months ago. His opinion was that there wasn't a huge need to revise/remove the top coding scripts, though he hadn't run the numbers to verify this.
I'm in favor of at least comparing the results that we would get from removing this part of the creation process. There are two problems preventing this from happening immediately though. First, our SAS license has expired and I just checked and the general AEI one has as well. Second, John hasn't sent me the script to recreate the weights file yet and I haven't completed my work on a python version yet so we can't create a new weights file at this time.
@andersonfrailey, Is this a formal issue in the taxdata repository? If not, should you create one? If so, should we close this issue (and have the taxdata issue point to it for the historical record)?
@andersonfrailey, Is the issue in Tax-Calculator #1658 covered in the Open-Source-Economics/taxdata#125 issue?
@martinholmer asked:
Is the issue in Tax-Calculator #1658 covered in the Open-Source-Economics/taxdata#125 issue?
In my opinion it is not. The two are related and the issue in #1658 (removing/revamping our top coding routine) is dependent on TaxData issue #125, but it will not be completely addressed in that issue. I would prefer a separate issue is opened when we begin to work on reviewing our top coding routine.
Closing issue #1658 in favor of open taxdata issue 174. Please continue the discussion of CPS top-coding in 174.
Merged pull request #1635 added two new variables to the
cps.csv
file. Here is what was said in the taxdata repo about these two new variables:Combining the
h_seq
andffpos
values for each family will produce a unique identifier within a CPS sample. However, thecps.csv
sample contains records from three different CPS samples (for different years). So, without knowing which of the three CPS samples each record is from, it is impossible to match a record in thecps.csv
file with its complete information from a Census CPS file. Wasn't the whole idea behind adding these variables to allow users to get extra information from the Census CPS file to exact match with the records in thecps.csv
file? If so, then it would seem as if they cannot do that.The following tabulations illustrate the problem:
So, for example, the three records with
h_seq
=2 andffpos
=1 have unique RECID values, but that doesn't help a user figure out which Census CPS file those threecps.csv
records were drawn from.@MattHJensen @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun