Defining filing status - Githubissues

This issue was created to discuss the best approaches to defining filing status. taxdata already does a reasonable job of defining non-filers and filers that is helpful in appending non-filers to the PUF using CPS data. However, Tax-Calculator makes no use of the filer variable (and it has been renamed to data_source). @donboyd5 has recently been trying to impute filing status for the purposes of validating some Tax-Calculator output and has run into some issues in accurately defining filer status (see Tax-Calculator Issue #2501).

As I try to get my head around the issue of defining filer status, I wanted to outline where filer status is (or might be) useful in taxdata and Tax-Calculator and ask some questions I have about the issue.

Where knowing filing status is (or could be) useful:

To create a puf.csv file that represents both filers and non-filers. (Note: filing status is currently used this way in taxdata.).
To age the base data as some targets may represent only the population of tax filers. (Note: filing status is currently used this way in taxdata)
In Tax-Calculator output (?) e.g., It seems like (1) changes in the number of filers might be an important result to look at for a policy reform and (2) it maybe that some aggregates depend on knowing who is/isn't a filer. (Note: filing status is currently not used in any way in Tax-Calculator)

Why determining filing status is difficult:

There are somewhat specific rules for who is required to file, but everyone has the option to file. Thus, determining voluntary filers from data like the CPS is uncertain (presumably all filers in the PUF chose to file, either because they were required to or otherwise chose to do so). taxdata takes a partially probabilistic approach to identifying these filers (see here), while @donboyd5 's approach here takes a deterministic view of assigning likely filers as filers.
Thresholds for determining filing status change over time under current law (e.g., because of adjustments to the nominal dollar amount that determines the income threshold for filing). While the change in this threshold over time is known (and parameterized through 2018 in filing_rules.json) and thus can be accounted for, I'm not sure if this is done when the base data are aged or extrapolated. Also, if filing status for records is allowed to change over time, aging can become complicated in that there can potentially be interactions between the determination of filing status, weighting, and growth factors in the targeting process when aging data.
If one wants to look at output from Tax-Calculator that depends on filing status (see (3) above), then one needs to know how changes in policy parameters affect filing status.

Some questions:

@andersonfrailey: Does taxdata assume that once a household is identified as a filer/non-filer in the base year data that it retains that status forever (e.g., even if the blowup factors push gross income over the filing threshold for some future year)? My read is that it is constant, but I could be missing something.
@MattHJensen Can you envision users of Tax-Calculator caring about filing status?
@MattHJensen Should some Tax-Calculator output depend on filing status (e.g., does TC currently compute some tax liability for non-filers?)?
@andersonfrailey When aging/extrapolating data it seems that one would need to account for changes in tax law since the year of the base data file through present (e.g., because of changes in definitions of income/deduction items, rate changes, etc.). Would you agree? How does taxdata handle that? Or are all targeted moments independent of tax law?

My few thoughts:

I think all of the reasons you give for having filing status in taxdata are important; I do think people will care about how a reform changes the number of filers
I thought the required/likely deterministic approach worked well
it seems to me you file because:
1. you are required to based on IRS income thresholds
2. you must file because you have income tax liability
3. you are eligible for a refundable credit and have to file to get it
4. you aren't eligible for a refundable credit and fraudulently file to get one
5. you earned wages and therefore paid withholding (or, less likely, estimated tax) but have no income tax and want to claim a refund
6. you didn't pay withholding but fraudulently file to claim a refund
I think categories 1 and 2 should be treated as deterministic although I guess you might be able to find some data on the proportion and maybe distribution of required filers who do not do so and people who do file based on an erroneous understanding of the rules but it doesn't seem to me like it is worth a lot of effort. I treated them as deterministic. The only issue I ran into was that I didn't have the time/bandwidth to define the Social Security component of gross income for filing purposes perfectly, although I think I defined it well. I experimented with variants of what I did and the variants had miniscule impacts on the numbers of filers. I think it is worth being perfect if someone has the time.
I am quite sure that some people in categories 3 and 5 for whom it might make tax-financial sense to file do not do so even though they could for several reasons, including (a) fear of immigration issues, and (b) de minimis potential refund or credit not worth the effort. I am not sure how you get information on the distribution of these people in order to do it probabilistically. I treated 3 as deterministic - if Tax-Calculator says you're eligible, then you file. I treated 5 as deterministic but with a de minimis wage threshold - having wages would only drive you to file if you had at least $1k of wages. Obviously this was an arbitrary judgment. This certainly could be done better - by examining withholding tables for example, and making a better arbitrary judgement on the amount of withholding paid at different wage levels and using that to help decide how to set a wage threshold (basing it on wages in the data rather than withholding in the data on the assumption that we don't think 2011 withholding provides any actual useful information for years we care about and besides, we don't have it for the universe of people most likely to be brought in - created from the CPS). There also might be external information from UI or CPS data that would provide insight into the number of low wage workers/families in those data vs. the number of low wage taxpayers in SOI data that might be helpful in doing this probabilistically. For my purposes I was satisfied with deterministic.
We know there are people in categories 4 and 6 but I ignored them.

On your item 4 it seems to me the proper sequence is:

start with a given year - let's say 2014 to be concrete and let's assume that the steps below have already been done for earlier years - calculate tax law for 2014, calculate filer status for that year under whatever rules you have for that year
decide upon growfactors in moving from 2014 to 2015; presumably some are informed by SOI data which by definition is only available for filers, and other data which may only be available for everyone; for convenience/lack of data you may end up having the same growfactors for everyone - I would expect that - but I suppose that would not be necessary; it would require knowledge and confidence that are hard to have
grow the dollar values, etc. to 2015
calculate tax for 2015 under rules for that year, calculate filers under rules for that year
repeat until done, meaning you have an unweighted file for each year with tax law for that year and filers for that year
Now develop weights for each year, weighting the filer records so that they approximate what we know in great detail (until the end of published data) about filers, and weighting nonfilers to what we know about them, which probably is far less detail
Presumably (I would advocate) this would be done differently for different years - use the best data you have for targets, until you run out of published targets, and for later years weight to hit a far smaller set of aggregates and impose some distributions (a combination of theoretical and estimated empirical) on the results for the pure forecast years. You might not make the filer/non-filer distinction in pure forecast years.

@MattHJensen Can you envision users of Tax-Calculator caring about filing status?

@jdebacker, I'm not sure what you mean here, but let me try to answer the best I can and please let me know if this isn't what you are looking for:

It seems many or most Tax-Calculator users should care about filing status indirectly, such as in the preparation of their input data. I can also see why policymakers might care about how policy reforms influence the number of filers and similarly the number of taxpayers required to file. As a policy process observer, though, I've seen much more attention paid to the number of taxpayers with positive IIT or IIT+FICA liability, a related but distinct concept. This came up, for instance, in Mitt Romney's 47% remarks.

@MattHJensen Should some Tax-Calculator output depend on filing status (e.g., does TC currently compute some tax liability for non-filers?)?

Tax-Calculator computes liabilities for any tax records given to it and then includes all records in its output. I think this is the right thing to do.

taxdata takes a partially probabilistic approach to identifying these filers

Note that taxdata has distinct approaches for identifying filers on the PUF and the CPS. The quote here is true for identifying which CPS records are filers or non filers for the purpose of whether they should be matched to PUF records or added to the file without a PUF match. All PUF-derived records, however, are assumed to be filers in every year.

On your item 4 it seems to me the proper sequence is:

start with a given year - let's say 2014 to be concrete and let's assume that the steps below have already been done for earlier years - calculate tax law for 2014, calculate filer status for that year under whatever rules you have for that year

...

Don's sequence makes great sense to me. To follow it exactly, though, I think we'll need to add 2011 and 2012 law to Tax-Calculator or buy a dataset from 2013 or later.

Does taxdata assume that once a household is identified as a filer/non-filer in the base year data that it retains that status forever (e.g., even if the blowup factors push gross income over the filing threshold for some future year)? My read is that it is constant, but I could be missing something.

@jdebacker yes. Once taxdata has labeled a household as filer/non-filer, they keep that status forever. I'd be interested in maybe integrating tax-calc into our extrapolation routine to calculate taxable income in each year and then use that to determine who would be required to file and who may be doing so voluntarily.

When aging/extrapolating data it seems that one would need to account for changes in tax law since the year of the base data file through present (e.g., because of changes in definitions of income/deduction items, rate changes, etc.). Would you agree? How does taxdata handle that? Or are all targeted moments independent of tax law?

taxdata doesn't worry about changes in tax law when extrapolating. We kind of outsource that to the CBO projections that are used to calculate growth factors. So if, for example, what counts as capital gains changes, we'll just assume that CBO will bake that into their total capital gains projections and it'll show up in our growth rates.

PSLmodels / taxdata

Defining filing status #366