PSLmodels / Tax-Calculator

USA Federal Individual Income and Payroll Tax Microsimulation Model
https://taxcalc.pslmodels.org
Other
260 stars 157 forks source link

Enhance transparency and replicability for users w/o proprietary data #627

Closed MattHJensen closed 7 years ago

MattHJensen commented 8 years ago

The fact that tax-calculator relies on semi-proprietary data for many use-cases, namely the IRS public use file, gives us a wonderful opportunity to explore and implement solutions to the following question:

“For modeling outfits that rely on proprietary data, what supplementary information can they publish to enable replication and deep peer review?”

After settling on the answer, I envision us creating a directory of these supplementary information whenever we bump the version of tax-calculator or TaxData, and then we would make that directory available via a link from the TaxBrain results page and the readmes of the repos.

One could argue that we don’t need to publish all of this since our code is open source; even where that might be true, it would be very helpful for others who don’t have open source models to produce these outputs, and part of our mission is to set an example.

Different classes of information that I have thought to include so far:

  1. Results of the model, using the proprietary data, for baseline and numerous example reforms.
    • This allows for a quick assessment of how the model works.
    • @amy-xu has taken a big step towards showing reform examples with her work here.
    • @martinholmer has done the same for the baseline with his work here.
  2. Dummy datasets and model results for those datasets.
    • This allows for users to test our calculator
    • @martinholmer and @gofroggyrun have made considerable progress on a dummy dataset and baseline results for for the 22 internet TAXSIM input variables and 28 intermediate variables.
    • We could also make the modified 1991 puf available and provide baseline results for it for all intermediate and final variables. @GoFroggyRun, could you look into this?
    • We don’t currently generate results for all years in the budget window, and we don’t demonstrate how results vary with different behavioral assumptions.
  3. Basic summary statistics for every input and calculated variable, using the proprietary data.
    • This allows users to intuitively understand the extrapolation as well as to try to match the key characteristics of our baseline data in their own dataset.
    • @amy-xu is working on this.
  4. A correlation matrix for all input data.
    • This would help users to impute any variables they are missing in their own data.
    • @amy-xu is working on this.
  5. A list of extrapolation targets and blowup factors for each variable in the dataset. Equations relevant to the extrapolation.
    • This would help users match our extrapolated dataset if they can construct a dataset that looks like ours in the base year.
      • @Amy-Xu is working on this.
  6. A list of all elasticities and behavioral adjustments applied.
    • We can improve the behavior doc strings and then link to readthedocs rendering.
  7. The equations and coefficients used for any imputed data.
    • We don’t impute any data yet, but @gofroggyrun is currently working on an imputation project.
  8. This one isn’t really a class of information to provide, but I think we should establish a policy of running code for people who don’t have access to the proprietary data after they have tested their code on dummy data.
    • The right way to do this might be to have a special tag for pull requests that aren’t meant to be merged but are meant to elicit some results on the puf.

I am very much looking forward to feedback and suggestions.

cc @martinholmer, @feenberg, @gofroggyrun, @talumbau, @amy-xu, @rickecon, @jdebacker, @kerkphil , @johnfohare

martinholmer commented 7 years ago

What is the status of issue #627 (which was opened almost nine months ago on 02-Mar-2016)? There was some immediate progress (via pull request #631), but there has been no discussion of this issue since then. Is there any reason to keep issue #627 open?

@MattHJensen @feenberg @Amy-Xu

Amy-Xu commented 7 years ago

@martinholmer I think 1-4 have already been included in the TC repo in April this year, and 5 is in tax-data repo. Not quite sure what the plan is for 6-8 though.

martinholmer commented 7 years ago

@Amy-Xu said about issue #627:

I think 1-4 have already been included in the [Tax-Calculator] repo in April this year, and 5 is in taxdata repo. Not quite sure what the plan is for 6-8 though.

@MattHJensen, Are items 6-8 in issue #627 still on the project to-do list? If any are still active, should they have there own (more specific) issue?

martinholmer commented 7 years ago

@Amy-Xu said in November 2016 about issue #627:

I think 1-4 have already been included in the TC repo in April this year, and 5 is in tax-data repo. Not quite sure what the plan is for 6-8 though.

@MattHJensen, Are items 6-8 in issue #627 still on the project to-do list? If any are still active, should they have their own (more specific) issue?

MattHJensen commented 7 years ago
  1. is already satisfied by docs/index.html.
  2. is not applicable yet, but @GoFroggyRun knows that it is relevant for his work on puf.csv imputations, and @andersonfrailey knows that it is relevant for his work on building the cps.csv from scratch.
  3. is not applicable until we are publicizing the cps.csv file, and I believe it will be obvious if it is needed.

Closing this issue.

@martinholmer @Amy-Xu