Open ernietedeschi opened 6 years ago
Thanks for pointing this out, @evtedeschi3. The deviations are not intentional. My intuition says this is probably caused by the scripts misidentifying the spouse when creating the record and assigning the wrong age. I'll look into this more and see if I can find the problem.
@andersonfrailey, I've checked the CPS spouse_age
problems that @evtedeschi3 first identified in #225 using the newest CPS data. There are still problems. First I show my tabulations (by MARS
) of the unzipped cps.csv.gz
file from Tax-Calculator release 0.20.2, and then I offer some observations.
iMac:Tax-Calculator mrh$ ./csv_vars.sh cps.csv | grep -e age -e MARS
1 age_head
2 age_spouse
42 MARS
iMac:Tax-Calculator mrh$ awk -F, 'NR>1{t++;n[$42]++}END{for(i in n)print i,n[i];print t}' cps.csv
2 252988
4 22006
1 181471
456465
iMac:Tax-Calculator mrh$ awk -F, 'NR>1&&$42!=2{n[$2]++;t++}END{for(i in n)print i,n[i];print t}' cps.csv
0 203477
203477
iMac:Tax-Calculator mrh$ awk -F, 'NR>1&&$42==2{n[$2]++}END{for(i in n)print i,n[i]}' cps.csv | awk '{printf("%02d\t%d\n",$1,$2)}' | sort | head -20
00 8455
01 150
02 168
03 221
04 183
05 184
06 121
07 162
08 138
09 126
10 173
11 135
12 199
13 113
14 174
15 180
16 150
17 188
18 221
19 312
iMac:Tax-Calculator mrh$ awk -F, 'NR>1&&$42==2{n[$2]++}END{for(i in n)print i,n[i]}' cps.csv | awk '{printf("%02d\t%d\n",$1,$2)}' | sort | tail -20
62 5196
63 4785
64 4482
65 5288
66 4782
67 4427
68 3638
69 3050
70 3202
71 3347
72 3045
73 2351
74 1815
75 1848
76 1704
77 1506
78 1362
79 1304
80 3975
85 2616
So, we can see that spouse_age
is zero in all the filing units that are not MARS
==2 (married filing jointly), which is as it should be. So, everything is good so far. But when we tabulate the distribution of spouse_age
for those with MARS
==2, we see sensible counts for older ages, but not so sensible counts for younger ages. In particular, there are 8455 filing units with MARS
==2 and spouse_age
==0. And then there are more than a few filing units who have an implausibly low values for spouse_age
. The lowest spouse_age
value in the puf.csv
data file for filing units with MARS
==2 is 15 years old.
Now that you've successfully completed all the recent enhancements to the taxdata repo, it seem like fixing this CPS spouse_age
problem should have a high priority. In particular, I think this CPS spouse_age
problem needs to be fixed before we consider moving to a more recent CBO projection (as proposed in #180.
What are you thoughts on taxdata development? Are there any other things that need to be fixed?
@martinholmer the next biggest step in taxdata development from my perspective is replacing the SAS code to make the CPS file with Python. I put that on hold the last couple of weeks to work on PUF development, but I've reached a point where I have almost everything written and have moved on to squishing bugs that result in major differences between the current CPS and what I get from the Python scripts.
I think it will be easier to solve this problem with spouse age when we have everything running in Python. I'd say it will be at least two or three more weeks before I'm ready to open a pull request though. Would you say this issue is high enough priority to try and fix it in the SAS code?
I assume the priority here is to get the code in Python first before worrying about integrating the 2016 and 2017 ASEC releases?
I've been correcting the spousal errors by just importing the age recorded in the CPS ASECs for those records using the household and family identifiers. That seems to work fine for the moment.
@andersonfrailey said:
I think it will be easier to solve this problem with spouse age [in the
cps.csv.gz
file] when we have everything running in Python. I'd say it will be at least two or three more weeks before I'm ready to open a pull request though. Would you say this issue is high enough priority to try and fix it in the SAS code?
No. And @evtedeschi3 seems to agree, which is far more important. So, let's wait for the Python CPS-creation code to be active and then solve the spouse_age
problem identified in issue #225.
I'm matching raw CPS microdata from 2013-15 to the
tc
cps.csv file using theh_seq
,ffpos
, anda_lineno
variables. I do a second round of matching off of thea_spouse
variable in the CPS microdata to bring the same variables of interest into thetc
cps.csv spousal records.As a validation cross-check, I compared the age of each record in the CPS microdata to that in the
tc
cps.csvFor tax unit heads, the age match between the CPS microdata and the
tc
cps.csv is 100%.For spouses, the match is only 97.8% however.
In looking at the underlying misses, the cps.csv file has some implausible values for spouse age.
For example, the spouse for RECID 292373 has an age of 3. In RECID 294599, the age is 7.
Some other cps.csv spouses have ages of 0 despite being present in the matched CPS record and the unit being correctly coded as MARS = 2 in the cps.csv. RECID 292658 is an example of this.
Most of the misses are plausible value in their own right but still different from the matched CPS record, sometimes significantly.
Are these deviations intentional?