PSLmodels / Tax-Calculator

USA Federal Individual Income and Payroll Tax Microsimulation Model
https://taxcalc.pslmodels.org
Other
254 stars 153 forks source link

Review Tax-Calculator speed #2376

Open MattHJensen opened 4 years ago

MattHJensen commented 4 years ago

Some users have suggested that Tax-Calculator simulations are inconveniently long, particularly for data files with many records.

It would be helpful for someone to conduct an audit of Tax-Calculator's performance and attempt to identify concrete options for speed improvements.

jdebacker commented 4 years ago

@MattHJensen This is great idea. When doing this, I wonder if there are changes that could be made to make the code clearer, but at least as fast. E.g., the "jitted" functions are a bit opaque and there are times columns of DataFrames are pulled out into arrays over again for different operations. Many of these seem like they could be done while keeping columns in a DataFrame. If normal DataFrame operations aren't as fast as the current jitted operations, one could look at using dask DataFrames to easily parallelize the operations.

MaxGhenis commented 4 years ago

Has there been a comparison of the numba/jit approach with just vectorizing everything and staying in pandas?

For example, the EI_PayrollTax function uses jit/numba, though it would be straightforward to vectorize in pandas: replace min and max with np.minimum and np.maximum, respectively.

At a gut level I'd expect vectorizing to be faster, since numpy and pandas are at their core about optimizing those kinds of operations. Are there any functions that this wouldn't work for?

hdoupe commented 4 years ago

I think this is a good time to look into Tax-Calculator performance. Much of the performance-optimized code was written several years ago, and I'm sure that there have been improvements in scientific computing technology since then (e.g. now we have Dask).

One thing that might be helpful for people who are digging into this is the jupyter notebook prun command: Screenshot from 2019-09-27 09-50-22

Using a profiler will give a better idea of where Tax-Calculator is spending its time before you start tinkering with different approaches/enhancements. Jake Vanderplas has a blog post on profiling that may be helpful.

hdoupe commented 4 years ago

Also, Tax-Brain parallelizes Tax-Calculator computations by splitting them up by year. Perhaps, users who need faster sims could try to run Tax-Calculator with Tax-Brain. This may not give them the flexibility that they need though.

donboyd5 commented 4 years ago

Today I ran into a surprise:

hdoupe commented 4 years ago

Next, I read in the full dump csv file and wrote it as a second csv file, in R. The write operation took only 6 seconds. Thus, it appears that Tax-Calculator was taking about 13x as long to write its output file (dump file) as a csv file as it took R to write the identical csv file (80 / 6 ~= 13).

@donboyd5 how long does it take to do a similar operation with this file in Python? i.e.

import time
import pandas as pd

df = pd.read_csv("yourfile.csv")

s = time.time()
df.to_csv("yourfile2.csv")
f = time.time()

print("elapsed time: ", f - s)

I'm curious whether Python and Pandas are the bottle neck or if the bottleneck is in Tax-Calculator?

donboyd5 commented 4 years ago

It looks to me like it might be a little bit of both -- Pandas 42 secs, R 6 secs. I did some googling around yesterday and several threads suggested that Pandas to_csv is a little slow.

Here it is in python:

image

Now in R:

image

jdebacker commented 2 years ago

@MattHJensen did PR #2570 accomplish this for you? If not, what else would you like to see regarding profiling of the code?