Open MattHJensen opened 4 years ago
@MattHJensen This is great idea. When doing this, I wonder if there are changes that could be made to make the code clearer, but at least as fast. E.g., the "jitted" functions are a bit opaque and there are times columns of DataFrames are pulled out into arrays over again for different operations. Many of these seem like they could be done while keeping columns in a DataFrame. If normal DataFrame operations aren't as fast as the current jitted operations, one could look at using dask DataFrames to easily parallelize the operations.
Has there been a comparison of the numba/jit approach with just vectorizing everything and staying in pandas?
For example, the EI_PayrollTax
function uses jit/numba, though it would be straightforward to vectorize in pandas: replace min
and max
with np.minimum
and np.maximum
, respectively.
At a gut level I'd expect vectorizing to be faster, since numpy and pandas are at their core about optimizing those kinds of operations. Are there any functions that this wouldn't work for?
I think this is a good time to look into Tax-Calculator performance. Much of the performance-optimized code was written several years ago, and I'm sure that there have been improvements in scientific computing technology since then (e.g. now we have Dask).
One thing that might be helpful for people who are digging into this is the jupyter notebook prun
command:
Using a profiler will give a better idea of where Tax-Calculator is spending its time before you start tinkering with different approaches/enhancements. Jake Vanderplas has a blog post on profiling that may be helpful.
Also, Tax-Brain parallelizes Tax-Calculator computations by splitting them up by year. Perhaps, users who need faster sims could try to run Tax-Calculator with Tax-Brain. This may not give them the flexibility that they need though.
Today I ran into a surprise:
I've always been surprised at how long the Tax-Calculator CLI seems to take - sometimes 10 or more minutes for a single run. When I created make-believe calculators in another language that I thought would do far more work than Tax-Calculator (e.g., calling 200 mathematical functions in a row on 1 million records, in Julia or R), it usually only took a few tens of seconds.
I've always assumed that Tax-Calculator does a lot of start-up processing and input error-checking, and perhaps that took a long time. Or perhaps it was not vectorized (maybe it's not). In R that could make the difference between a few seconds and a few hours.
But I've always run Tax-Calculator with a full output dump.
Today, because I know that I will never need more than 5 specific Tax-Calculator output variables for a task I am looking at, I used the dvars option and only dumped those 5 plus FLPDYR and RECID (7 variables in total).
To my surprise, running tc CLI on a file with 163k records and dumping these output variables only took 36 seconds on my machine.
So I reran it dumping the full output -- 206 variables. That took 118 seconds. (Often I run files with 10x as many records, and run time can be 10x as long.) This suggests to me that Tax-Calculator took at least 80+ seconds to write a csv file with 163k records and 206 variables (unless a full dump also requires a lot of additional non I-O processing that a small dump does not require for some reason).
Next, I read in the full dump csv file and wrote it as a second csv file, in R. The write operation took only 6 seconds. Thus, it appears that Tax-Calculator was taking about 13x as long to write its output file (dump file) as a csv file as it took R to write the identical csv file (80 / 6 ~= 13).
I am going to guess that, at least for me, the dump-file write operation is the primary cause of the slowness I perceive. I don't know all that much about python, but if R has the ability to write csv files much more quickly, I'll guess that there also are python csv-file-writing functions that can be much faster.
Next, I read in the full dump csv file and wrote it as a second csv file, in R. The write operation took only 6 seconds. Thus, it appears that Tax-Calculator was taking about 13x as long to write its output file (dump file) as a csv file as it took R to write the identical csv file (80 / 6 ~= 13).
@donboyd5 how long does it take to do a similar operation with this file in Python? i.e.
import time
import pandas as pd
df = pd.read_csv("yourfile.csv")
s = time.time()
df.to_csv("yourfile2.csv")
f = time.time()
print("elapsed time: ", f - s)
I'm curious whether Python and Pandas are the bottle neck or if the bottleneck is in Tax-Calculator?
It looks to me like it might be a little bit of both -- Pandas 42 secs, R 6 secs. I did some googling around yesterday and several threads suggested that Pandas to_csv is a little slow.
Here it is in python:
Now in R:
@MattHJensen did PR #2570 accomplish this for you? If not, what else would you like to see regarding profiling of the code?
Some users have suggested that Tax-Calculator simulations are inconveniently long, particularly for data files with many records.
It would be helpful for someone to conduct an audit of Tax-Calculator's performance and attempt to identify concrete options for speed improvements.