PSLmodels / taxdata

The TaxData project prepares microdata for use with the Tax-Calculator microsimulation project.
http://pslmodels.github.io/taxdata/
Other
21 stars 30 forks source link

Spousal age in the CPS file #225

Open ernietedeschi opened 6 years ago

ernietedeschi commented 6 years ago

I'm matching raw CPS microdata from 2013-15 to the tc cps.csv file using the h_seq, ffpos, and a_lineno variables. I do a second round of matching off of the a_spouse variable in the CPS microdata to bring the same variables of interest into the tc cps.csv spousal records.

As a validation cross-check, I compared the age of each record in the CPS microdata to that in the tc cps.csv

For tax unit heads, the age match between the CPS microdata and the tc cps.csv is 100%.

For spouses, the match is only 97.8% however.

In looking at the underlying misses, the cps.csv file has some implausible values for spouse age.

For example, the spouse for RECID 292373 has an age of 3. In RECID 294599, the age is 7.

Some other cps.csv spouses have ages of 0 despite being present in the matched CPS record and the unit being correctly coded as MARS = 2 in the cps.csv. RECID 292658 is an example of this.

Most of the misses are plausible value in their own right but still different from the matched CPS record, sometimes significantly.

Are these deviations intentional?

andersonfrailey commented 6 years ago

Thanks for pointing this out, @evtedeschi3. The deviations are not intentional. My intuition says this is probably caused by the scripts misidentifying the spouse when creating the record and assigning the wrong age. I'll look into this more and see if I can find the problem.

martinholmer commented 6 years ago

@andersonfrailey, I've checked the CPS spouse_age problems that @evtedeschi3 first identified in #225 using the newest CPS data. There are still problems. First I show my tabulations (by MARS) of the unzipped cps.csv.gz file from Tax-Calculator release 0.20.2, and then I offer some observations.

iMac:Tax-Calculator mrh$ ./csv_vars.sh cps.csv | grep -e age -e MARS
1 age_head
2 age_spouse
42 MARS

iMac:Tax-Calculator mrh$ awk -F, 'NR>1{t++;n[$42]++}END{for(i in n)print i,n[i];print t}' cps.csv
2 252988
4 22006
1 181471
456465

iMac:Tax-Calculator mrh$ awk -F, 'NR>1&&$42!=2{n[$2]++;t++}END{for(i in n)print i,n[i];print t}' cps.csv
0 203477
203477

iMac:Tax-Calculator mrh$ awk -F, 'NR>1&&$42==2{n[$2]++}END{for(i in n)print i,n[i]}' cps.csv | awk '{printf("%02d\t%d\n",$1,$2)}' | sort | head -20
00  8455
01  150
02  168
03  221
04  183
05  184
06  121
07  162
08  138
09  126
10  173
11  135
12  199
13  113
14  174
15  180
16  150
17  188
18  221
19  312

iMac:Tax-Calculator mrh$ awk -F, 'NR>1&&$42==2{n[$2]++}END{for(i in n)print i,n[i]}' cps.csv | awk '{printf("%02d\t%d\n",$1,$2)}' | sort | tail -20
62  5196
63  4785
64  4482
65  5288
66  4782
67  4427
68  3638
69  3050
70  3202
71  3347
72  3045
73  2351
74  1815
75  1848
76  1704
77  1506
78  1362
79  1304
80  3975
85  2616

So, we can see that spouse_age is zero in all the filing units that are not MARS==2 (married filing jointly), which is as it should be. So, everything is good so far. But when we tabulate the distribution of spouse_age for those with MARS==2, we see sensible counts for older ages, but not so sensible counts for younger ages. In particular, there are 8455 filing units with MARS==2 and spouse_age==0. And then there are more than a few filing units who have an implausibly low values for spouse_age. The lowest spouse_age value in the puf.csv data file for filing units with MARS==2 is 15 years old.

Now that you've successfully completed all the recent enhancements to the taxdata repo, it seem like fixing this CPS spouse_age problem should have a high priority. In particular, I think this CPS spouse_age problem needs to be fixed before we consider moving to a more recent CBO projection (as proposed in #180.

What are you thoughts on taxdata development? Are there any other things that need to be fixed?

andersonfrailey commented 6 years ago

@martinholmer the next biggest step in taxdata development from my perspective is replacing the SAS code to make the CPS file with Python. I put that on hold the last couple of weeks to work on PUF development, but I've reached a point where I have almost everything written and have moved on to squishing bugs that result in major differences between the current CPS and what I get from the Python scripts.

I think it will be easier to solve this problem with spouse age when we have everything running in Python. I'd say it will be at least two or three more weeks before I'm ready to open a pull request though. Would you say this issue is high enough priority to try and fix it in the SAS code?

ernietedeschi commented 6 years ago

I assume the priority here is to get the code in Python first before worrying about integrating the 2016 and 2017 ASEC releases?

I've been correcting the spousal errors by just importing the age recorded in the CPS ASECs for those records using the household and family identifiers. That seems to work fine for the moment.

martinholmer commented 6 years ago

@andersonfrailey said:

I think it will be easier to solve this problem with spouse age [in the cps.csv.gz file] when we have everything running in Python. I'd say it will be at least two or three more weeks before I'm ready to open a pull request though. Would you say this issue is high enough priority to try and fix it in the SAS code?

No. And @evtedeschi3 seems to agree, which is far more important. So, let's wait for the Python CPS-creation code to be active and then solve the spouse_age problem identified in issue #225.