PSLmodels / tax-microdata-benchmarking

A project to develop a benchmarked general-purpose dataset for tax reform impact analysis.
https://pslmodels.github.io/tax-microdata-benchmarking/
2 stars 6 forks source link

Enhancement: Avoid rerunning vdf = all_taxcalc_variables(), after first area, when running make_all.py #225

Closed donboyd5 closed 1 week ago

donboyd5 commented 1 week ago

@martinholmer, On my machine, when running python -m tmd.areas.make_all on xx, yy, and zz, the line vdf = all_taxcalc_variables() in create_area_weights.py takes about 14 seconds on the first area (xx) and about 3 seconds on each of the next 2 areas (yy and zz). (I am not sure why the time drops after area 1. I am guessing it is either related to overhead that does not have to be repeated, or to parallelism. If you can educate me on this I'd appreciate it.)

As I read it, vdf will be the same for all areas. If there is a way to cut the 3 seconds on areas 2...n, perhaps by keeping vdf in memory after area 1 and not recreating it for areas 2...n (or, alternatively, saving vdf to a fast binary file on area 1, and reading it back in for areas 2..n), that could be a substantial time savings in production.

If we are creating 435 Congressional Districts and we can save 3 seconds on 434 of them, that's a savings of more than 21 minutes which would be a great benefit in production runs where every sliver of time saved will be valuable.

martinholmer commented 1 week ago

@donboyd5 said in issue #225:

I am not sure why the time drops after area 1. I am guessing it is either related to overhead that does not have to be repeated, or to parallelism. If you can educate me on this I'd appreciate it.

The tax calculations in Tax-Calculator are done using JIT-compiled numba code, so the first time includes the JIT compilation overhead.

donboyd5 commented 1 week ago

Thanks! Sent from my phone; please excuse brevity and speech-to-text errors.

On Tue, Sep 24, 2024, 3:22 PM Martin Holmer @.***> wrote:

@donboyd5 https://github.com/donboyd5 said in issue #225 https://github.com/PSLmodels/tax-microdata-benchmarking/issues/225:

I am not sure why the time drops after area 1. I am guessing it is either related to overhead that does not have to be repeated, or to parallelism. If you can educate me on this I'd appreciate it.

The tax calculations in Tax-Calculator are done using JIT-compiled numba code, so the first time includes the JIT compilation overhead.

— Reply to this email directly, view it on GitHub https://github.com/PSLmodels/tax-microdata-benchmarking/issues/225#issuecomment-2371263940, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR4JGF5G64W33ILKEGAYEDZYFRSHAVCNFSM6AAAAABOYBPIBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZRGI3DGOJUGA . You are receiving this because you were mentioned.Message ID: @.***>