PSLmodels / Tax-Calculator

USA Federal Individual Income and Payroll Tax Microsimulation Model
https://taxcalc.pslmodels.org
Other
255 stars 154 forks source link

TaxBrain and Tax-Calculator results are not the same in 0.7.3 #1119

Closed martinholmer closed 7 years ago

martinholmer commented 7 years ago

For a simple income and payroll tax reform, the results generated by TaxBrain and Tax-Calculator are not the same for income taxes. I get the same results on my local computer using either a custom Python script (shown below) or the inctax.py command-line interface to taxcalc. And I get the same results from TaxBrain using either the JSON reform file upload capability or hand-entered parameters.

Notice that payroll taxes under the reform are the same in TaxBrain and Tax-Calculator, but the income tax revenues are lower in TaxBrain than in Tax-Calculator by about 1.1 percent or about $19.1 billion.

I'm not sure where the problem is. Can others replicate my local results on their computers?

Here is the JSON reform file (file0.json):

{
    // 2022 RESULTS from taxcalc-inctax and taxbrain-upload version 0.7.3
    // INCOME TAX ($B)   1711.31            1692.2
    // PAYROLL TAX ($B)  1485.37            1485.3
    "policy": {
        "_SS_Earnings_c": // social security (OASDI) maximum taxable earnings
        {"2018": [400000],
         "2019": [500000],
         "2020": [600000]
        },
        "_II_em": // personal exemption amount
        {"2018": [8000]
        }
    },
    "behavior": {
    },
    "consumption": {
    },
    "growth": {
    }
}

Here are the taxcalc-inctax results:

$ python ../../inctax.py puf.csv 2022 --blowup --weights --reform file0.json
You loaded data for 2009.
Your data have been extrapolated to 2022.
$ awk '{r+=$4*$29}END{print r*1e-9}' puf-22.out-inctax-file0
1711.31
$ awk '{r+=$6*$29}END{print r*1e-9}' puf-22.out-inctax-file0
1485.37

And here is the custom Python script that generates the same local results:

import os
import sys
CUR_PATH = os.path.abspath(os.path.dirname(__file__))
sys.path.append(os.path.join(CUR_PATH, '..', '..'))
from taxcalc import *

def print_results(calc):
    record_columns = ['s006', '_iitax', '_payrolltax']
    out = [getattr(calc.records, col) for col in record_columns]
    dfx = pd.DataFrame(data=np.column_stack(out), columns=record_columns)
    wsum = weighted_sum(dfx, '_iitax') * 1.0e-9
    print "ITAXES($b)={:9.3f}".format(wsum)
    wsum = weighted_sum(dfx, '_payrolltax') * 1.0e-9
    print "PTAXES($b)={:9.3f}".format(wsum)

polref, _, _, _ = Calculator.read_json_reform_file('file0.json')
pol = Policy()
pol.implement_reform(polref)
calc = Calculator(policy=pol, records=Records())
calc.advance_to_year(2022)
calc.calc_all()
print_results(calc)

which produces the following results:

ITAXES($b)= 1711.286
PTAXES($b)= 1485.362

@MattHJensen @feenberg @talumbau @Amy-Xu @GoFroggyRun @andersonfrailey @zrisher

talumbau commented 7 years ago

When I run with taxcalc version 0.7.3 on my Mac with the latest PUF in my possession (from November 16 of size 53246969), I get the following results:

You loaded data for 2009.
Your data include the following unused variables that will be ignored:
  cmbtp_standard
  cmbtp_itemizer
Your data have been extrapolated to 2013.
ITAXES($b)= 1692.223
PTAXES($b)= 1485.362

When I go to the Reform tab on the results page when running this reform on the latest TaxBrain, I see these results:

screen shot 2017-01-03 at 1 04 06 pm

so I get the same answer as TaxBrain.

talumbau commented 7 years ago

I'm guessing from these lines:

CUR_PATH = os.path.abspath(os.path.dirname(__file__))
sys.path.append(os.path.join(CUR_PATH, '..', '..'))

you are not running on the release, but since it's only a few commits behind master, I don't think that's the issue. I wonder if your PUF version is more recent. Can you say how many bytes your puf.csv file is in size?

talumbau commented 7 years ago

this is the link to my TaxBrain job: http://www.ospc.org/taxbrain/9380/

martinholmer commented 7 years ago

@talumbau said:

When I run with taxcalc version 0.7.3 on my Mac with the latest PUF in my possession (from November 16 of size 53246969) [I get the same results as on TaxBrain].

Not sure what puf.csv file you have because it isn't like any puf.csv file that has been distributed over the last few months. Here is a Mac listing of the recent versions I have on my computer:

-rw-r--r--@ 1 mrh  staff   54674801 Jul 13 16:32 puf-2016-07-13.csv
-rw-r--r--  1 mrh  staff   52839666 Sep 14 12:20 puf-2016-09-14.csv
-rw-r--r--@ 1 mrh  staff   52807336 Oct 28 16:06 puf-2016-10-28.csv
-rw-r--r--  1 mrh  staff   50953138 Nov 21 15:38 puf-2016-11-21.csv
-rw-r--r--@ 1 mrh  staff   50953138 Nov 21 15:41 puf.csv

Notice none of them contain 53246969 bytes.

When I go to the OSPC developer's private Google Doc and download the most recent version available (the one dated 11.21.2016 for taxcalc #1077 and taxdata #48) it is exactly the same size as the last two in my listing above.

Where did you get your puf.csv with 53246969 bytes? It doesn't correspond to anything in the Google Doc list of puf.csv releases.

@MattHJensen @Amy-Xu @andersonfrailey

andersonfrailey commented 7 years ago

Using @martinholmer's JSON file I was able to reproduce his results on my computer as well. I'm also using the PUF file uploaded on November 21st of size 50953138.

talumbau commented 7 years ago

@martinholmer I forwarded you the email containing the link to the PUF I used. The email was sent by @andersonfrailey on November 7th. It seems clear this is not the best PUF to use. Is that correct?

andersonfrailey commented 7 years ago

@talumbau I also sent out an email on November 21st after TaxData PR 48 was merged with a link to the latest PUF. I will forward that email on to you.

talumbau commented 7 years ago

I can confirm that using the PUF from november 21st produces the same results that @martinholmer and @andersonfrailey report. All of the worker nodes for TaxBrain have the PUF that I was using from the November 7th email. Should those worker nodes be updated with the PUF from November 21?

This issue raises the urgency level of the issue @PeterDSteinberg raised on taxdata: https://github.com/open-source-economics/taxdata/issues/53

martinholmer commented 7 years ago

@talumbau said:

I can confirm that using the PUF from november 21st produces the same results that @martinholmer and @andersonfrailey report. All of the worker nodes for TaxBrain have the PUF that I was using from the November 7th email. Should those worker nodes be updated with the PUF from November 21?

Yes.

@talumbau also said:

This issue raises the urgency level of the issue @PeterDSteinberg raised on taxdata: open-source-economics/taxdata#53

But there is still no answer to the question posed by @zrisher thirteen days ago.

Why not always use the latest puf.csv listed in the private Google Doc table?

martinholmer commented 7 years ago

The exchange of information in Tax-Calculator issue #1119 suggests several ideas for improving the way we distribute the private puf.csv file.

First, the description of each version of the puf.csv file should include the size of the file in bytes. And it would probably be better to distribute not a bare puf.csv file but a zip file that contains the puf.csv file. For example, the current version could be in the pufcsv-2016-11-21.zip file.

Second, we need a better method for distributing this file.

One idea is to create a conda package containing the puf.csv file as discussed in taxdata issue 53. However, as @zrisher said:

My biggest concern is this:

How would we declare the dependency in tax-calculator in such a way that contributors without access to that package could still build their environment?

Right now puf.csv is an optional dependency, but Conda doesn't understand such things.

... [snip] ...

Hoping someone has an easy answer to this.

Given that nobody has offered an answer in the two weeks since this concern was raised suggests there might not be an "easy answer".

What about a different approach to distributing the puf.csv file that relies not on conda-package technology but rather uses a GitHub private repository? This private repo could be very simple: (a) the README.md file would contain the same information as currently in the Google Doc table (plus the byte size of the file) and (b) the puf.csv file. The GitHub Release mechanism could be used to package each version of the puf.csv file in a zip or tar.gz file. Each release of a new puf.csv file could trigger an email to each person who has access to the private repo. This is important because it would seem that the current problem stems (in part) from emails being sent to only some of those who use the puf.csv file.

Perhaps there is a better approach. Can anybody suggest a better approach? Certainly some kind of improvement in necessary because the development team is wasting too much time on this matter and we risk disappointing public users of TaxBrain with mistakes like the one discussed in issue #1119.

@MattHJensen @talumbau @feenberg @Amy-Xu @andersonfrailey @zrisher @PeterDSteinberg

zrisher commented 7 years ago

Using git is an interesting idea. @martinholmer's suggestion helps with the change notification to contributors, but doesn't automate the distribution. It could supplement the suggestion from https://github.com/open-source-economics/taxdata/issues/53 that we distribute the puf data as a conda package.

Combining those ideas, what if we:

So there's no significant change for contributors without access and a single additional step for contributors with access. It won't automatically ensure your package version is correct, but it will throw an error until you make it so. For automated environments, you'll probably always be throwing away your pre-deploy environment and pulling the latest anyway.

martinholmer commented 7 years ago

@zrisher, Thanks for contributing to the conversation in #1119 about how to improve the methods we use to distribute the private puf.csv file. Let me comment on or ask questions about some of your points.

(1) Automation of puf.csv distribution:

Using git is an interesting idea. @martinholmer's suggestion helps with the change notification to contributors, but doesn't automate the distribution.

I think we need an explicit conversation about why we should "automate the distribution" of the puf.csv file. For me, there is virtually no value to automation; in fact, the overhead involved in any automation mechanism makes the benefits minus the costs of automation negative. What is the benefit of automation? Who benefits? And who is inconvenienced by automation?

(2) Data format of the puf distribution:

start a private git repo with the puf.csv data and the python package providing it as a dataframe

Not sure I understand your point. Are you suggesting that the private puf data be a binary Pandas DataFrame rather than an ASCII CSV-formatted file? If so, I think this is not a good idea. Yes, it is true that the Records class constructor can accept a DataFrame (rather than a CSV file name). But changing the file format to a DataFrame makes it impossible to analyze the puf data using command-line tools. There is plenty of this sort of work that goes on every day. At this stage of the project, I think we should change the file format of the puf data only if the benefits are quite substantial and exceed by a wide margin the costs of changing the file format. Can that case be made?

Or, if I have completely misunderstood your suggestion, please set me straight on your thinking.

(3) Process of puf distribution:

So there's no significant change for contributors without [puf] access and a single additional step for contributors with access. It won't automatically ensure your package version is correct, but it will throw an error until you make it so.

What does "no significant change" mean? Does it mean absolutely no change? If not, what would be the changes for those without puf access?

For those with puf access, why impose additional costs on them if the puf distribution process "won't automatically ensure your package version is correct"? What's the point of it all? Why not notify each user of the new puf on the private GitHub repo and let them download and install it on their computer(s)?

The problem identified in issue #1119 seems to be caused primarily by a lack of communication within the development team and the absence of an easily accessible (but private) place to store new versions of the puf data. I'm not yet convinced that the #1119 problem has much to do with a lack of automation.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey @codykallen @talumbau @PeterDSteinberg

feenberg commented 7 years ago

On Wed, 4 Jan 2017, Martin Holmer wrote:

The exchange of information in Tax-Calculator issue #1119 suggests several ideas for improving the way we distribute the private puf.csv file.

First, the description of each version of the puf.csv file should include the size of the file in bytes. And it would probably be better to distribute

The number of bytes will depend on the the line-ending convention (2 bytes for windows, 1 byte for Mac/Linux) - probably the number of records is a more useful indicator of completeness.

What about a different approach to distributing the puf.csv file that relies not on conda-package technology but rather uses a GitHub private repository?

Is this the CPS derived file? Then why private?

dan feenberg

martinholmer commented 7 years ago

@feenberg said:

The number of bytes will depend on the the line-ending convention (2 bytes for windows, 1 byte for Mac/Linux) - probably the number of records is a more useful indicator of completeness.

Yes, end-of-line is different on Windows, but one puf.csv file (made on a Mac I think) seems to work on all operating systems. Number of records has been the same for many puf.csv versions, so this does little to distinguish versions.

What about a different approach to distributing the puf.csv file that relies not on conda-package technology but rather uses a GitHub private repository?

Is this the CPS derived file? Then why private?

No, the puf.csv file includes the private IRS-SOI PUF records plus some CPS-derived non-filer records.

martinholmer commented 7 years ago

In order to make an intelligent decision about how to improve puf.csv distribution among Tax-Calculator developers, it is important to understand exactly why the mixup identified in issue #1119 occurred. The point of determining this is not to assign blame, but to understand what needs to be improved in the puf.csv distribution process.

My earlier comment in the #1119 discussion has already identified the need to make email notifications of new puf.csv versions more systematic and the need to have a better private place to hold the puf.csv versions. It may well be that the #1119 mixup occurred because of shortcomings in these two areas.

But there seems to be another factor involved in the #1119 mixup. It would seem as if TaxBrain developers are not running the requires_pufcsv tests that are part of the Tax-Calculator test suite. The evidence for this is that at least one of those tests would have failed because they were mistakenly using an old, out-dated version of the puf.csv file. The evidence for this is presented below. The moral of the story is that even if we somehow improve the puf.csv distribution process, there is no substitute for running the full test suite described in the TESTING.md file.

Here is the evidence that suggests that the full Tax-Calculator test suite has not been executed by TaxBrain developers since sometime in November, which is when the puf.csv distribution mixup occurred. With TaxBrain 0.7.3 using the old puf.csv file, current-law-policy result for aggregate individual income tax revenue in 2022 is $1830.7 billion (see these TaxBrain results for details). But if the requires_pufcsv tests had been run by the TaxBrain developers, they would have realized they were using the wrong puf.csv file because the latest version produces aggregate individual income tax revenue for 2022 of $1849.6 billion (see the pufcsv_agg_expect.txt file. This is a difference of almost 19 billion dollars, which would have alerted them to the presence of a major problem.

@MattHJensen @talumbau @PeterDSteinberg @Amy-Xu @andersonfrailey @zrisher

PeterDSteinberg commented 7 years ago

I have created a private Anaconda package taxpuf that will help with Jenkins continuous integration testing. You all can try it out and see if you want to transition the distribution of puf.csv.gz as in the taxpuf private package I have created. I realized we can gradually transition to this means of puf.csv.gz distribution, using it first internally for developers and Jenkins, then later seeing how Tax-Calculator and other packages should interact with taxpuf exactly.

Get the Anaconda token file from me

I'll send you all an email with this token. Save it at ~/.ospc_anaconda_token

Conda install with token

$ export TOKEN=$(cat ~/.ospc_anaconda_token)
$ conda install -c https://conda.anaconda.org/t/${TOKEN}/opensourcepolicycenter taxpuf

Simple usage

Write the most recent puf.csv.gz to the local directory

write-latest-taxpuf && ls puf.csv.gz

Now run Tax-Calculator, deploy, etc repositories like usual.

Optional Usage - latest puf file as a dataframe

import taxpuf
PUF = taxpuf.get_puf() # the latest one
PUF.head() # is a dataframe loaded from gzip

More examples

# all PUF dataframes
all_pufs = {k: taxpuf.get_puf(k) for k in taxpuf.PUF_META}
# Get an old PUF dataframe
old_puf = taxpuf.get_puf('puf102716.csv.gz')
# write the latest puf as ./puf.csv.gz
taxpuf.write_puf()

cc @MattHJensen @talumbau @PeterDSteinberg @Amy-Xu @andersonfrailey @zrisher @martinholmer

martinholmer commented 7 years ago

@PeterDSteinberg said:

I'll send you all an email with this token. Save it at ~/.ospc_anaconda_token

I haven't received the email you promised.

PeterDSteinberg commented 7 years ago

Sent.

martinholmer commented 7 years ago

@PeterDSteinberg, Thanks for the taxpuf package and the secret token you sent via email. I was able to follow your "Conda install with token" steps and the "Simple usage" commands to unzip the latest puf.csv file. I have a few questions about the new taxpuf package.

(1) If we ever got a developer who was working on Windows, how would the install work (with export being a Unix command)? And would entering write-latest-taxpuf at the Windows command line work?

(2) How would a member of the development team, or anyone else with access to the OSPC-developed puf.csv file, be notified that there is a new version of the puf.csv file?

(3) How would a notified user update? Using the conda update taxpuf command?

The puf.csv distribution mechanism of the taxpuf Anaconda package seems fine to me, but the problems of unsystematic notification and less-than-comprehensive testing that caused the #1119 mixup are not solved by the taxpuf package. Do you have any thoughts on how those problems could be addressed?

@MattHJensen @talumbau @feenberg @Amy-Xu @andersonfrailey @zrisher @codykallen

PeterDSteinberg commented 7 years ago

@martinholmer Here are answers to each of your questions:

1) People who don't want to work with environment variables or hidden files, can just cut and paste the token into the channel URL: conda install -c https://conda.anaconda.org/t/ABCDEF/opensourcepolicycenter taxpuf, replacing ABCDEF in that example with the token. If you get tired of remembering the -c argument needed in any case, you can do a one-time conda config command to always allow that channel:

conda config --add channels https://conda.anaconda.org/t/ABCDEF/opensourcepolicycenter

(again replacing ABCDEF)

1b) Yes, Windows users can do the write-latest-taxpuf console entry point. It is a platform-independent alias to taxpuf.write_puf() function.

2) This taxpuf solution does not yet solve notification, but it could. taxpuf includes a script build_puf_package.py to build taxpuf for all Python's and architectures as a conda package. This build_puf_package.py script could also email notify the team of the new version of taxpuf released.

3) Updates would be with conda update taxpuf. It is also necessary to give the -c argument as in conda install, or to use the conda config --add channels idea mentioned above.

I think if Tax-Calculator and other repos eventually used from taxpuf import get_puf;puf=get_puf() to get the public use data, it would be less likely to have puf confusion. I say that because taxpuf as a conda dependency could be written into deploy scripts, even if not a hard requirement of Tax-Calculator, and devs are more likely to remember conda updates than updates of a single file (puf.csv.gz) that has to be maintained in a custom way with scp.

martinholmer commented 7 years ago

@PeterDSteinberg said:

Yes, Windows users can do the write-latest-taxpuf console entry point. It is a platform-independent alias to taxpuf.write_puf() function.

This taxpuf solution does not yet solve notification, but it could. taxpuf includes a script build_puf_package.py to build taxpuf for all Python's and architectures as a conda package. This build_puf_package.py script could also email notify the team of the new version of taxpuf released.

Updates would be with conda update taxpuf [command]. It is also necessary to give the -c argument as in conda install, or to use the conda config --add channels idea mentioned above.

Thanks for your answers to my questions. This is nice work: a clear improvement over the current puf.csv distribution method. Your answers got me thinking about other questions and a suggestion.

(1) Where is the source code (including the build_puf_package.py script) behind the taxpuf package?

(2) I think it would be a high priority to add the automatic email notification feature into the build_puf_package.py script as you suggest. @MattHJensen, what do you think about the priority of this task?

(3) Once users get the email notification, they would execute this command:

conda upgrade -c https://conda.anaconda.org/t/${TOKEN}/opensourcepolicycenter taxpuf

Is that correct?

@talumbau @Amy-Xu @andersonfrailey @zrisher @feenberg @GoFroggyRun @codykallen

PeterDSteinberg commented 7 years ago

@martinholmer Answers:

1) The build_puf_package.py and wrapper around taxpuf is not yet in version control. We can either A) make a separate private Anaconda package for it, or B) make a private github repo for it. Thoughts? In the meantime, I will email a zip that includes all the code we would be putting in A) or B).

2) We could add email notification to the build script (build_puf_package.py), or alternatively we could make the private github repo B) mentioned above and rely on its notification system for releases.

3) Correct. This is how you update the package once you know you have to (and there is no harm/slow-down in running it if no updates are available): conda upgrade -c https://conda.anaconda.org/t/${TOKEN}/opensourcepolicycenter taxpuf

martinholmer commented 7 years ago

@PeterDSteinberg, Thanks for your speedy and complete answers to my questions in issue #1119.

You should be asking @MattHJensen about this, but my initial thought is that putting the taxpuf source code in a private GitHub repository and using that repository's capabilities to notify users of a new puf.csv version would be the best approach (assuming GitHub repos can do that sort of thing). This approach is more consistent with the way OSPC in now working, so there would be more staff familiarity with this (B) approach.

@talumbau

MattHJensen commented 7 years ago

+1 for the private GH repo and email notifications. The repo should go in the opensourcepolicycenter GH organization rather than open-source-economics.
@PeterDSteinberg

MattHJensen commented 7 years ago

@talumbau, do you have an estimate for when you'll be able to update TaxBrain to the nov-21 puf.csv?

talumbau commented 7 years ago

I can do this by end of day today. I'll update this issue when it's done.

talumbau commented 7 years ago

The PUF has been updated on the worker nodes and the services restarted. This taxbrain job:

http://www.ospc.org/taxbrain/9428/

confirms that TaxBrain now gives the same results as taxcalc 0.7.3.

talumbau commented 7 years ago

@martinholmer said:

But there seems to be another factor involved in the #1119 mixup. It would seem as if TaxBrain developers are not running the requires_pufcsv tests that are part of the Tax-Calculator test suite. The evidence for this is that at least one of those tests would have failed because they were mistakenly using an old, out-dated version of the puf.csv file. The evidence for this is presented below. The moral of the story is that even if we somehow improve the puf.csv distribution process, there is no substitute for running the full test suite described in the TESTING.md file.

This is a good idea and indeed the requires_pufcsv tests are not run after install of the package. Software deployment should minimize human steps and have failsafes such that, if a human does not do the right thing, it is impossible to proceed. Because we now permit an old PUF to run with a newer version of Tax-Calculator, we need such a failsafe, so that if the human has made an error, the deployment process will error out. @PeterDSteinberg is re-working the deployment scripts for the worker nodes. Perhaps this step "run requires_pufcsv tests" should come just before "turn on all services". @PeterDSteinberg what do you think of this idea?

martinholmer commented 7 years ago

@talumbau said:

The PUF has been updated on the worker nodes and the services restarted.

Thanks. Your ideas for more comprehensive pre-deployment TaxBrain testing sound good.

martinholmer commented 7 years ago

Issue #1119 seems to be resolved with TaxBrain now using the correct version of the puf.csv file and plans for improving pre-deployment TaxBrain testing being developed. Closing #1119.

zrisher commented 7 years ago

@martinholmer @talumbau @PeterDSteinberg @MattHJensen It could be useful to document the aforementioned setup and update steps for the taxpuf dependency within this repo or the taxpuf repo.