Closed martinholmer closed 5 months ago
Now that the data_source
variable is included in the tmd.csv
file, we can calculate gross and taxable social security benefits for those with a data_source
value of one to compare with the IRS-SOI Publication 4801 tabulations of 2021 income tax returns.
2021 social security benefits ($b) and total number of returns (#m):
IRS-SOI tmd.csv
1040 returns 160.824 174.185
gross 791.161 888.394
taxable 412.830 511.620
So, we have too many returns, too many gross social security benefits, and too many taxable social security benefits.
Here are the details of the tabulations of social security benefits in issue #83:
(taxcalc-dev) Tax-Calculator% tc tmd.csv 2021 --tables --exact --reform ssben.json --sqldb --dvars ssben.dvars
You loaded data for 2021.
Tax-Calculator startup automatically extrapolated your data to 2021.
(taxcalc-dev) Tax-Calculator% ls -l tmd-21-*db
-rw-r--r-- 1 mrh staff 12496896 May 25 11:49 tmd-21-#-ssben-#.db
(taxcalc-dev) Tax-Calculator% echo ".schema" | sqlite3 tmd-21-#-ssben-#.db
CREATE TABLE IF NOT EXISTS "baseline" (
"s006" REAL,
"RECID" INTEGER,
"c02500" REAL,
"FLPDYR" INTEGER,
"data_source" INTEGER,
"e02400" REAL
);
CREATE TABLE IF NOT EXISTS "reform" (
"s006" REAL,
"RECID" INTEGER,
"c02500" REAL,
"FLPDYR" INTEGER,
"data_source" INTEGER,
"e02400" REAL
);
(taxcalc-dev) Tax-Calculator% sqlite3 tmd-21-#-ssben-#.db < ssben.sql
---ALL DATA CLP ---
weights:
219.594
gross ssbens:
1212.904
taxable ssbens:
515.689
---DATA_SOURCE==1 CLP ---
weights:
174.185
gross ssbens:
888.394
taxable ssbens:
511.62
---DATA_SOURCE==1 T-E_REFORM ---
weights:
174.185
gross ssbens:
888.394
taxable ssbens:
888.394
---DATA_SOURCE==0 CLP ---
weights:
45.409
gross ssbens:
324.51
taxable ssbens:
4.069
---DATA_SOURCE==0 T-E_REFORM ---
weights:
45.409
gross ssbens:
324.51
taxable ssbens:
324.51
Thanks for this, @martinholmer. I think there are a few issues we'll want to consider:
data_source
indicator after growth to 2021 by growfactors and weighting.is_filer
, which applies filer rules and filing incentives (e.g., to claim a refund) to the data after growth and after tax calculation. This may be a casualty of our recent move to the tmd.csv
approach because I am assuming that we no longer have year-specific values of is_filer
and will want to find a way to recreate it after tax calculation. Is this correct, @nikhilwoodruff? That is, we no longer have a useful is_filer
indicator? I think we're going to need a routine to construct is_filer
and is_taxpayer
to make it easier to compare to IRS reported statistics. Maybe we should just have something for 2021, the year for which these comparisons are particularly useful. We probably should have it for 2015, also, so that we know how different our is_filer
conclusion is from data_source
for that year. Maybe we should have is_filer_2015
, is_filer_2021
, is_taxpayer_2015
, and is_taxpayer_2021
always on the file? I'm not saying we shouldn't use data_source
- but I think it's valuable to have the ability to look at all of these.data_source
) but does not show item 2 (filers with SS benefits) - if it's not hard to do, would you be able to add this when you have a chance? My guess is that with all filers too high, we'll be too high on SS filers, too.is_...
variables would help us assess the extent to which the differences in 2021 are driven by incorrect growth to 2021 vs. incorrect values in 2015. The PUF for 2015 could be very different from what the IRS reports for 2015.is_filer_2021
.Further, the IRS data show 30.7% growth in reported total SS benefits, which is far faster than the 11.4 growth in SS returns. I'd expect that our extrapolation methods would not capture this growth and we'd be really far off on the average,. Anyway, it shouldn't be too hard for us to figure out what is going on.
Another question is what to do about the very large difference between total SS benefits paid reported by the SSA and total SS benefits reported by tax filers on tax returns as reported by the IRS -- $1,133b vs. $791b. In theory, if we have a file that represents the total U.S. population, should we expect essentially all of the gap to be included in the nonfilers universe? (There could be some under-reporting by filers, but I suspect that would be minimal because SSA 1099 is received by the IRS as well as by the tax filer so it should be easy to audit plus filers should know this and be wary.) If so, it seems like we should examine the magnitude and distribution of non-filer Social Security to see if we're in the right ballpark. It's probably not a big issue in the near term but would be worth looking at for a sanity check.
Anyway, there are a lot of questions to explore here. I'm happy to pitch in on the diagnostic analyses, @martinholmer but I don't want to be redundant but I'll hold off for now because there are some other analyses I can work on; we can catch up on Wednesday.
@donboyd5 said among other things in issue #83:
Another question is what to do about the very large difference between total SS benefits paid reported by the SSA and total SS benefits reported by tax filers on tax returns as reported by the IRS -- $1,133b vs. $791b. In theory, if we have a file that represents the total U.S. population, should we expect essentially all of the gap to be included in the nonfilers universe?
I don't see that there is anything to be done about that. The total gross social security benefits in the 2021 tmd.csv
file is about $1,213 billion, which is only modestly above the the SSA administrative total of $1,133 billion. It is just that, as Dan pointed out early, many nonfilers are elderly people living on just social security benefits. But maybe I'm missing your point. Why exactly do you expect nonfilers to have little or no social security?
@donboyd5 said among other things in issue #83:
Getting back to Social Security, here's what IRS has for number of returns with total Social Security benefits in 2015 and 2021 -- 11.4% growth. This almost certainly is far faster than the 6-year returns growth we must have in
tmd.csv
weights - the population growth factor presumably was near 6% or a bit less.
What ever is going on in the Policyengine-US data creation and the TMD weights creation, it leaves us with more gross social security benefits in 2021 than SSA reports paying in 2021.
To me the biggest problem is that while we are getting too many gross social security benefits and we are getting way too many taxable social security benefits. How is the TMD repo handling the reweighting? Is it possible that high-income elderly filers are having their weights increased, and therefore, raising the taxable social security benefit total?
I have now added the s006_original
variable to the tmd.csv
data file and have added that variable to the Tax-Calculator records_variables.json
file so that it can be included in tc
dump output. The table below consolidates the results so far:
TMD_WEIGHTS ORIGINAL_WEIGHTS AGENCY_STATISTIC
2021 ALL UNITS:
tax units (#m) 219.594 196.143 ------
gross ssben (#b) 1212.904 1098.870 1133.163 (SSA)
gross sscases (#m) 48.991 44.280
taxable ssben ($b) 515.689 443.511 ------
taxable sscases (#m) 29.978 25.272
2021 PUF UNITS:
tax units (#m) 174.185 150.828 160.824 (IRS)
gross ssben (#b) 888.394 774.580 791.161 (IRS)
gross sscases (#m) 32.684 27.982
taxable ssben ($b) 511.620 439.512 412.830 (IRS)
taxable sscases (#m) 27.451 22.754
2021 CPS UNITS:
tax units (#m) 45.409 45.315 ------
gross ssben (#b) 324.510 324.290 ------
gross sscases (#m) 16.307 16.298
taxable ssben ($b) 4.069 3.999 ------
taxable sscases (#m) 2.527 2.518
@donboyd5 said among other things in issue #83:
Another question is what to do about the very large difference between total SS benefits paid reported by the SSA and total SS benefits reported by tax filers on tax returns as reported by the IRS -- $1,133b vs. $791b. In theory, if we have a file that represents the total U.S. population, should we expect essentially all of the gap to be included in the nonfilers universe?
I don't see that there is anything to be done about that. The total gross social security benefits in the 2021
tmd.csv
file is about $1,213 billion, which is only modestly above the the SSA administrative total of $1,133 billion. It is just that, as Dan pointed out early, many nonfilers are elderly people living on just social security benefits. But maybe I'm missing your point. Why exactly do you expect nonfilers to have little or no social security?
Sorry, I didn't mean to imply that nonfilers would have little or no Social Security. I was trying to say that because they have a lot, it's important to examine the magnitude and distribution of nonfiler Social Security. Your note here says the magnitude is reasonably close. I think at some point, we also want to think about the distribution of nonfiler Social Security. Because we don't have IRS tables for that, I suppose it comes down to examining the distribution we have in comparison to CPS distribution. Certainly not a near term issue as we have bigger fish to fry.
I have now added the
s006_original
variable to thetmd.csv
data file and have added that variable to the Tax-Calculatorrecords_variables.json
file so that it can be included intc
dump output.
This is really helpful, thank you.
@donboyd5 said among other things in issue #83:
Getting back to Social Security, here's what IRS has for number of returns with total Social Security benefits in 2015 and 2021 -- 11.4% growth. This almost certainly is far faster than the 6-year returns growth we must have in
tmd.csv
weights - the population growth factor presumably was near 6% or a bit less.What ever is going on in the Policyengine-US data creation and the TMD weights creation, it leaves us with more gross social security benefits in 2021 than SSA reports paying in 2021.
To me the biggest problem is that while we are getting too many gross social security benefits and we are getting way too many taxable social security benefits. How is the TMD repo handling the reweighting? Is it possible that high-income elderly filers are having their weights increased, and therefore, raising the taxable social security benefit total?
Yes, agreed. I think this should be part of Wednesday's call. We should be able to break it down into three pieces:
We have a little intelligence on this now, but not enough:
Now, here's what the IRS data tell us (see table in previous comment):
IRS shows about 17% growth in the per-return total and 11% in the number of returns, for total growth of about 31%. Comparing the ASOCSEC 0.5% decline to IRS 17%, it seems like we have way too little ASOCSEC growth. All else equal, this worsens our problem, of course, because in the end we have too much gross and way too much taxable SS benefits.
We can't yet compare this properly to our tmd data - we need to compare weighted tmd sum of e02400 in 2015 using original-weights to weighted sum in 2021, after growfactors (ASOCSEC) using original-weights for 2021. I can do that but not until tomorrow.
By targeting # returns and weighted-sum-AGI by AGI range, and not targeting total SS income (and other income totals), and by not telling the algorithm to penalize changes in weights, the algorithm will pick any old weight adjustments that hit AGI and # returns targets, quite possibly jerking around SS filers in the process. Anyway, we should be able to figure all of this out.
@donboyd5 said in the discussion of issue #83:
Impact of growfactors from 2015-2021. As I read Tax-Calculator (taxcalc/records.py lines 273-372), ASOCSEC is used to adjust e02400 (Total social security (OASDI) benefits) on line 314. I believe this is the per-record adjustment. I do not see an adjustment for population growth (a factor to be applied to the weights) but I presume there must be one. (If there is, would you mind pointing me to where it is applied?) If I read taxdata properly, lines 48-49 of factors_finalprep.py adjust ASOCSEC by something called elderly_pop, but I don't see a similar adjustment in Tax-Calculator - it's clear I don't understand what's going on. Anyway, ASOCSEC in both the tmd and taxdata repos declines by 0.5% between 2015 and 2021 (tmd shown); this, I believe is just the per-record amount, not the total amount:
Don, I don't think any of the Tax-Calculator growfactors are used in preparing the 2021 tmd.csv
file. So, what you say above not relevant to this issue.
Closing issue #83 wrt social security benefits because a soon to be available version of the repository will produce different results, which will be included in a new issue.
Here are some aggregate 2021 statistics related to total social security benefits.
From the OASI benefit amounts and DI benefit amounts links at SSA OCACT benefits page, we have:
The $1133 billion total social security (OASDI) benefits paid during 2021 compares with the weighted sum of total social security benefits (
e02400
) from thetmd.csv
file of about $1212.9 billion.So, the
tmd.csv
file has aggregate social security benefits that are about 7% larger than the benefits actually paid.The IRS-SOI Publication 4801 tabulations of 2021 income tax returns has $791.161 billion in total social security benefits and $412.830 billion in taxable social security benefits.
Unfortunately, given the data problem described in issue #78, we cannot yet tabulate the
tmd.csv
file for the PUF-based subtotal of total social security benefits to compare with the $791.161 billion figure.The 3.6.0 release of Tax-Calculator generates under current-law policy, the following
tmd-21-#-#-#.csv
dump output file:So, the current version of
tmd.csv
generates $515.689 billion in taxable social security benefits, which is about 25% above the IRS-SOI tabulated $412.830 billion. But again, without valid values fordata_source
in thetmd.csv
file, we have no idea how much of the $512.689 billion is attributable to those who file income taxes, and therefore, are in the IRS-SOI PUF microdata file.