Extrapolate welfare data

PSLmodels / taxdata

The TaxData project prepares microdata for use with the Tax-Calculator microsimulation project.

http://pslmodels.github.io/taxdata/

Other

20 stars 30 forks source link

Extrapolate welfare data #106

Closed Amy-Xu closed 7 years ago

Amy-Xu commented 7 years ago

The third task outlined in this issue is to develop an extrapolation routine for welfare data in CPS tax unit dataset. An initial thought is to assume for each program, participation and benefit grow at respectively X and Y percent each year, where X and Y are derived from historical data. (If official projection targets are available, then we could use those targets directly.) Then could use the same logit regression for imputation to meet the targets for participation growth and then apply an uniform ratio to everyone in order to blow up total benefit.

Many details need to be considered, but for now the most tricky part is whether to do this extrapolation on tax unit or original program benefit unit (individual/household) in raw CPS. Individual or household level is natural since all projection or historical data would be available at these level; however, this will create enormous difficulty afterwards because raw CPS needs to go through tax-unit creation process. The weights of records do not stay the same over time and thus extrapolation based on 2014 raw CPS weight cannot guarantee hitting the targets in later years. On the other hand, extrapolating the data at tax-unit level would make later steps easier, but there isn't any targets or historical welfare data at tax-unit level.

Other things to consider:

Should we consider population growth the same way as tax-data extrapolation?
Non-filer targets? Since many benefit programs would require non-filer numbers to be accurate.

Any thoughts? @MattHJensen @martinholmer @andersonfrailey @hdoupe

Amy-Xu commented 7 years ago

Have been collecting targets & historical data, and will keep updating this spreadsheet.

Amy-Xu commented 7 years ago

When clarifying this extrapolation problem to @hdoupe last Friday, I realized there might be a way to do the extrapolation at 'tax unit' level without participation targets at tax unit level. I thought it through again and haven't seen any loophole yet. So would love to hear how everyone feels about this plan:

Attach individual level probability (derived from logit regression based on 2014 data) that indicates program participation to CPS dataset in the tax-unit creating process, and in this way obtain each tax-unit member's likelihood of participation in one program or another.
Add or remove participation according to the probability attached, to meet the participation targets assumably derived from the official projection or historical data (see the spreadsheet from the comment above). Since this imputation is based on the weights of all tax units in 10 years, we need no adjustment afterward.
Then adjust the benefit total according to the targets.
Keep the participation & benefit with CPS or separately, and drop the probability variables in the final cleaning script.

Looks feasible? Any thoughts? @MattHJensen @martinholmer @andersonfrailey

ps the spreadsheet has been updated to include SSI, SNAP, VB, Social Security, Medicare and Medicaid.

andersonfrailey commented 7 years ago

@Amy-Xu, overall I think this sounds feasible. A couple of questions.

Attach individual level probability (derived from logit regression based on 2014 data) that indicates program participation to CPS dataset in the tax-unit creating process, and in this way obtain each tax-unit member's likelihood of participation in one program or another.

So during the tax-unit creation process, there will be one additional variable for each person in the household containing their participation probability. Would we also be keeping the total amount received in separate variables as well as the one aggregate so we can add/subtract benefits on a individual basis?

Keep the participation & benefit with CPS or separately, and drop the probability variables in the final cleaning script.

Is your idea to have a column in the CPS dedicated to each of the variables for each of the years in the weights file, and then add a function in Tax-Calculator that will specify which one will be used in each year?

Amy-Xu commented 7 years ago

@andersonfrailey asked:

Would we also be keeping the total amount received in separate variables as well as the one aggregate so we can add/subtract benefits on a individual basis?

Good question! I haven't thought about this part. Certainly it would be easier to have the individual level dollar amount on the side, as long as the workload is not too much. So for each program, in addition to one aggregate, we'll have an individual level participation probability and dollar amount of benefit received. Depending on how many people one tax unit has, we might have unit size times two number more variables for each program. If that's too much work, I think we can also make it work without the individual dollar amount -- just subtracting an even-splited amount should be fine as well.

Is your idea to have a column in the CPS dedicated to each of the variables for each of the years in the weights file, and then add a function in Tax-Calculator that will specify which one will be used in each year?

That's right. I would prefer those new variables in a separate file though.

Amy-Xu commented 7 years ago

A quick update on the issue: I drafted an initial version of extrapolation, and Hank @hdoupe revamped it and significantly improved the efficiency of the algorithm. Posting his script here and welcome any feedback!

martinholmer commented 7 years ago

@Amy-Xu said:

A quick update on the issue: I drafted an initial version of extrapolation, and Hank @hdoupe revamped it and significantly improved the efficiency of the algorithm. Posting his script here and welcome any feedback!

Thanks, but I haven't been following this issue at all. Can you step back and explain what the implications of this advance are for the forthcoming cps.csv file? What output from the new script will be in the cps.csv or related file? Or, am I hopelessly confused and this has nothing to do with the cps.csv and related files?

@hdoupe @andersonfrailey

Amy-Xu commented 7 years ago

@martinholmer asked:

Can you step back and explain what the implications of this advance are for the forthcoming cps.csv file? What output from the new script will be in the cps.csv or related file?

The forthcoming cps.csv file will include welfare program data and this new script will create extrapolated participation and benefits for each year all the way to 2026.

All the outputs, I imagine, will be saved in a separate file and transferred to the Tax-Calculator once cps.csv is ready. The separate file will work in a similar way as puf_weights.csv, such that in each year the benefits of each tax unit will be replaced with the extrapolated values generated from the new script here.

martinholmer commented 7 years ago

@Amy-Xu said in taxdata issue #106:

Have been collecting targets & historical data, and will keep updating this spreadsheet.

When I look at this spreadsheet, it contains projections of federal average benefit amounts for SSI. Those don't seem like very good extrapolation targets given that most states supplement the federal SSI benefit.

This reference has this to say about state supplementation: However, in most states, SSI recipients receive an additional supplementary payment from their state, giving them a monthly benefit amount that's higher than the federal amount ($735 in 2017). Every state except Arizona, Mississippi, North Dakota, and West Virginia currently pays a state supplement to its disabled residents who receive SSI. Each state makes up its own rules about how much the monthly supplement is and who is entitled to the supplement. The amount of the state supplement ranges from $10 to $200, depending on the state.

And most SSI beneficiaries are disabled. Here is the SSI beneficiary count for 2015 from SSA: blind and disabled: 7.041 million; aged: 1.101 million; total: 8.142 million.

Given these very rough extrapolation targets (and the targets for the other programs will be even more speculative), why are you considering such an elaborate extrapolation method? The script developed by @hdoupe is impressive in its logic, but it seems way too ambitious given the weak information available to serve as extrapolation targets. Isn't there a simpler method? Seems like if we have limited and biased information there is little point in processing that information using elaborate algorithms. Is there a less elaborate method that is in better balance with the rough extrapolation targets?

Looking at the other benefits variables, I can't even imagine how speculative any SNAP projection would be. The cost of the future program depends on possible legislative changes and the state of the macro economy. Who knows what either of those are going to be like through 2026?

@hdoupe @andersonfrailey @MattHJensen

Amy-Xu commented 7 years ago

@martinholmer said:

When I look at this spreadsheet, it contains projections of federal average benefit amounts for SSI. Those don't seem like very good extrapolation targets given that most states supplement the federal SSI benefit.

Right I definitely agree with you that SSI includes both federal and state components. We have not started on a documentation for this routine, but what we applied for SSI in extrapolation is not the federal targets; instead, we applied the federal benefit growth rates to the adjusted federal and state benefits in 2014. In other words, currently we assume state benefit grows at the same pace as federal benefits, which is certainly not a perfect assumption; however it might be the best assumption so far given the scarcity of state level benefit information.

Martin also asked:

Given these very rough extrapolation targets (and the targets for the other programs will be even more speculative), why are you considering such an elaborate extrapolation method?

This is also a question I have been asking myself for a while. Surely other than a few humongous programs like Social Security or Medicare, projection for most welfare programs are highly speculative. Given the rough targets, what are the pros and cons of implementing an elaborate vs simple extrapolation routine? In my mind, pros of an elaborate routine are 1) match with assumed targets better, and 2) if targets are improved in future we can still use the routine without revamp it much so replacement cost is lower; at the same time, cons might be 1) it might take longer to develop than a simpler one, 2) takes longer to run, and 3) possibly more difficult to maintain.

But the cons are not completely reality. It took me one day to write the draft for SSI, and one day or so for Hank to improve the algorithm. Originally the script took ~5 min to run and now it takes ~1 min after Hank revamp it. I imagine it needs a few more tweaks to fit the data of all other programs, and the major portion of time would be spent on the CPS tax-unit side making sure the total aggregates right, instead of modifying the scripts.

I'm happy to discuss this more. Particularly if you have a simpler routine in mind, I would love to hear more about it.

martinholmer commented 7 years ago

@Amy-Xu said:

I'm happy to discuss this more. Particularly if you have a simpler routine in mind, I would love to hear more about it.

It's good you think about the pros and cons of the method you're using now, but you forgot to include the con having to do with all the extra work that would be required in Tax-Calculator.

I admit I might not completely understand all your goals, but I would like to suggest a simpler approach that would reduce work in C-TAM and taxdata and reduce work in Tax-Calculator.

You've done an good job getting CPS dollar benefits amounts to add up to administrative totals for 2014. And you have aggregate benefit totals you would like to come close to in years after 2014. You also have the CPS weights for each year after 2015. Couldn't you do something like this for each benefit variable?

For YEAR (ranging from 2015 through 2026) tabulate the weighted benefit total for YEAR using the 2014 benefit amounts and the weights for YEAR. Call this total R, for raw. Let the administrative target for this benefit in YEAR be called T, for target. The the extrapolation factor for YEAR for that benefit is F = T / R.

Each of your non-OASDI benefit variables can have their own personalized factors and we can add the annual values of those factors into an expanded growfactors.csv file.
Actually, what goes into the growfactors.csv file for a benefit variable in YEAR is F(YEAR)/F(YEAR-1). Then the logic of extrapolating the new benefit variables will be exactly the same as used in the existing version of Tax-Calculator. Currently, e02400 (social security benefits) is extrapolated in exactly this way, using a single-purpose factor. And the new benefit-variable growfactors will not be a problem when Tax-Calculator is using the puf.csv input file because applying a positive growfactor to zero benefits leaves those benefits at zero, which is what they should be when using the puf.csv input file.

Let's talk about the pros and cons of this simpler approach.

@Amy-Xu @hdoupe @andersonfrailey @MattHJensen

martinholmer commented 7 years ago

I fixed a mistake in my earlier comment on taxdata issue #106 by adding to that comment the following sentence:

Actually, what goes into the growfactors.csv file for a benefit variable in YEAR is F(YEAR)/F(YEAR-1).

@Amy-Xu @hdoupe @andersonfrailey @MattHJensen

Amy-Xu commented 7 years ago

Martin @martinholmer proposed a simpler routine for welfare extrapolation in the comment above. If I understand it right, I can see two big pros for this simpler routine. First, as Martin mentioned, it doesn't need as much work prior to TC stage. Potentially if any users have their own targets, they could replace the factors easily in TC, without turning to taxdata or C-TAM. Second, in TC, this routine doesn't need significant extra space to store factors, while the elaborate one would add a chunk of data for benefits of each year.

My biggest concern is about participation. It seems this simpler method would peg the participation growth rate to tax unit growth rate, while I have always assumed total number of participation is quite important for C-TAM, and presumably for extrapolation as well. But I have never confirmed it with anyone. Would love to hear input from @MattHJensen regarding this issue.

Regarding workload for adding this extrapolation to TC, I don't see much difference programming wise (may be I'm not knowledgable enough on the lastest TC). Since this is a part of deploying cps.csv to TC, we have to add facilities in TC to read in a separate cps_weight.csv, cps_ratio.csv, I assume. It doesn't seem to me, the simpler routine would be significantly superior to the elaborate one as each should just need a few extra lines of code.

martinholmer commented 7 years ago

@Amy-Xu said:

Martin proposed a simpler routine for welfare extrapolation in the comment above. If I understand it right, I can see two big pros for this simpler routine. First, as Martin mentioned, it doesn't need as much work prior to TC stage. Potentially if any users have their own targets, they could replace the factors easily in TC, without turning to taxdata or C-TAM. Second, in TC, this routine doesn't need significant extra space to store factors, while the elaborate one would add a chunk of data for benefits of each year.

My concern is not so much about the size of the extra "chunk of data", but the extra code that reads that extra data and then applies it only when using CPS input data.

@Amy-Xu continued:

My biggest concern is about participation. It seems this simpler method would peg the participation growth rate to tax unit growth rate, while I have always assumed total number of participat[ants] is quite important for C-TAM, and presumably for extrapolation as well. But I have never confirmed it with anyone. Would love to hear input from @MattHJensen regarding this issue.

Most of the results that come out of Tax-Calculator are dollar amounts. If you really want accurate beneficiary head counts, then you could always use Tax-Calculator to conduct a simulation for 2014.

I don't see how you can expect Tax-Calculator extrapolation to work differently for a handful of new benefit variables. It may be simplistic, but it is absolutely standard operating procedure in the tax simulation world, to extrapolate in the manner we already extrapolate social security benefits, e02400. This standard procedure does not change which filing units have zero and positive values of a variable. We are already in Tax-Calculator committed to this simplistic extrapolation of social security benefits and we can't change that.

In fact your complex method is likely to lead to unrealistic longitudinal results as you change filing unit participation from year to year in an ad hoc manner. In order to change program participation from year to year in an empirically plausible manner you would have had to use longitudinal data to estimate transition probabilities on and off each benefit program. Nobody is saying you should have done that. My point is that your method, which involves changing program participation without any guidance from longitudinal data, is very likely to introduce unrealistic patterns of program participation for a filing unit over the years after 2014.

If you want super-accurate beneficiary head counts after 2014, then you would have to do this outside of Tax-Calculator. But I think the notion that you are going to get super-accurate beneficiary counts in years after 2014 is a near fantasy given the subjectiveness of your extrapolation targets. Again, the subjectiveness of your extrapolation targets is not your fault. It is inherent in trying to forecast over a decade into the future what these programs are going to look like.

@Amy-Xu continued:

Regarding workload for adding this extrapolation to TC, I don't see much difference programming wise (may be I'm not knowledgable enough on the lastest TC). Since this is a part of deploying cps.csv to TC, we have to add facilities in TC to read in a separate cps_weight.csv, cps_ratio.csv, I assume.

Well you've been away from Tax-Calculator for a long time, so you are forgetting that the names of those files are just arguments of the Records class constructor function. So, in fact, there is no extra work reading those files. It is your proposal that creates a different kind of file --- one that is not used when reading the PUF related input files --- that creates all the extra work.

@Amy-Xu concluded:

It doesn't seem to me, the simpler routine would be significantly superior to the elaborate one as each should just need a few extra lines of code.

As you can see from my earlier comments, I beg to differ with that conclusion.

@Amy-Xu @hdoupe @andersonfrailey @MattHJensen

Amy-Xu commented 7 years ago

@martinholmer said:

I don't see how you can expect Tax-Calculator extrapolation to work differently for a handful of new benefit variables. It may be simplistic, but it is absolutely standard operating procedure in the tax simulation world

A UBI reform that involved removing all major welfare programs is actually not a common tax reform. In fact, connecting the welfare world to the tax analysis is rarely done, let alone welfare extrapolation, and is very difficult since few people know how the tax unit welfare distribution looks like given a individual or household welfare distribution. As you probably have seen in this working paper we released earlier this year, we not only care about the count of tax units in each income class, but also average number of people in each tax unit.

I have already acknowledged that the targets are not perfect; however, the progressive direction, in my opinion, is to see how we could improve those targets. Simply giving up participation targets may leave us stranded when the tax unit number or individual distribution looks non sensible -- we could have done a better job but didn't give it a go. I don't think it's the best to go backward under the argument that the targets are flawed.

@martinholmer also said:

Well you've been away from Tax-Calculator for a long time, so you are forgetting that the names of those files are just arguments of the Records class constructor function. So, in fact, there is no extra work reading those files. It is your proposal that creates a different kind of file --- one that is not used when reading the PUF related input files --- that creates all the extra work.

I'm aware that, and feel very convenient to plug in a new dataset without modifying any code. However, it seems to me CPS.csv is going to be one of the default options on TC, and eventually to be one of the default options on TaxBrain. I imagine we have to specify whether the input is PUF or CPS, which would require uploading cps_weights.csv and cps_ratios.csv to TC, and would require extra variables to label whether the input is PUF or CPS. These extra code would make integration to TB easier. Of course my specialty is not webapp development and this might not be necessary. I would love to hear more thoughts on this. @MattHJensen

If the upcoming change to TC for UBI simulation seems absolutely unacceptable to you, I offer to execute the welfare extrapolation outside TC in a notebook, which I think is feasible, as long as it doesn't block the web application development in later stage.

@MattHJensen @andersonfrailey @hdoupe

MattHJensen commented 7 years ago

@Amy-Xu said:

We have been collecting targets & historical data, and will keep updating this spreadsheet.

@Amy-Xu, do most programs, other than SSI, also have official projections for participation?

MattHJensen commented 7 years ago

Also, @Amy-Xu, have you tried running tax-calculator with cps.csv and the weights file produced by your and Hank's work?

Amy-Xu commented 7 years ago

@MattHJensen asked:

do most programs, other than SSI, also have official projections for participation?

Social Security, Medicare, Medicaid do, but SNAP and VB don't.

have you tried running tax-calculator with cps.csv and the weights file produced by your and Hank's work?

Not yet, but it isn't a weights file, it is an extrapolated benefit file that potentially would work in a similar way as the weights file -- replace the benefit column with a future year extrapolated benefit. To make it work with TC, we will need to add a few lines of code.

Amy-Xu commented 7 years ago

It seems there's a consensus on how to proceed on this extrapolation routine issue per discussion in PR #1500 in TC. Closing.