Improving speed and flexibility of TaxData extrapolation

donboyd5 commented 5 years ago

My colleagues @marshallpku, Gang Chen, and I may have an opportunity to examine TaxData extrapolation for possible speed and flexibility improvements, in preparation to run a large synthetic data file through TaxData. I have 4 questions (plus a bonus question) - feedback on any of these before we start would be useful.

Am I correct in understanding that TaxData extrapolation takes a long time to run (e.g., hours)?
And that this is primarily attributable to the reweighting stage (Stage 2), which runs a linear program designed to hit targets (i.e., constraints) while minimizing a penalty function, for each year (16 each for the PUF and CPS, by my count)?
What solver(s) are used?
- TaxData uses the PuLP wrapper to linear program solvers. Am I correct that TaxData uses PuLP’s default LP solver (line 192 of puf stage 2 and line 194 of cps stage 2?
  - Am I correct that this means that PuLP uses CBC (COIN-OR Branch and Cut solver), a mixed-integer linear program solver? (As best I can tell this is the default PuLP solver.)
  - And that CBC, when called via PuLP, uses the COIN-OR CLP linear program solver.
Has anyone investigated:
- Optional parameters for the CBC solver?
- Other solvers callable from PuLP?
- Other solvers callable from within python (not via PuLP)?
- Scaling the objective function?
- Scaling the constraints?
- Other objective functions?
- Inequality constraints?

Bonus question: I welcome any comments on the further info here for those who have a chance to look. I am especially interested in knowing of anything wrong in that document.

andersonfrailey commented 5 years ago

@donboyd5

Am I correct in understanding that TaxData extrapolation takes a long time to run (e.g., hours)?

Yes. It will typically take around 6-8 hours to run, depending on the file and what other processes you have running.

And that this is primarily attributable to the reweighting stage (Stage 2), which runs a linear program designed to hit targets (i.e., constraints) while minimizing a penalty function, for each year (16 each for the PUF and CPS, by my count)?

Yes. The length of time it takes to run can be attributed almost entirely to this process. Everything else takes ten minutes tops.

What solver(s) are used?

TaxData uses the PuLP wrapper to linear program solvers. Am I correct that TaxData uses PuLP’s default LP solver (line 192 of puf stage 2 and line 194 of cps stage 2)?

Am I correct that this means that PuLP uses CBC (COIN-OR Branch and Cut solver), a mixed-integer linear program solver? (As best I can tell this is the default PuLP solver.)

And that CBC, when called via PuLP, uses the COIN-OR CLP linear program solver.

Yes, these statements are all correct.

Has anyone investigated:

Optional parameters for the CBC solver?

Not extensively

Other solvers callable from PuLP?

No

Other solvers callable from within python (not via PuLP)?

Yes. I've looking into the linear programming model module in scipy, but couldn't figure out how we could use it with out changing the linear programming problem that we use. I would be open to changing the problem that we're solving to work with the scipy module if we still get similar results. I'm also open to switching to a new solver in general if the results are there.

Scaling the objective function?

No

Scaling the constraints?

No

Other objective functions?

No

Inequality constraints?

No

I'll take a look at the document you've linked to and leave comments there as well.

donboyd5 commented 5 years ago

Thanks, @andersonfrailey, this is exactly the kind of information @marshallpku, Gang Chen, and I need.

A few follow-up questions. If you know off the top of your, head, would be great to have info on:

Do you know whether the solver ever is memory bound - using all memory and swapping out to hard drive? Or is the problem just extremely CPU-intensive?
Do you ever find that a feasible solution cannot be found and if so is your response to increase the tol parameter in the call to solve_lp_for_year?
Do you know where the solver spends the bulk of its time - e.g., initializing, finding a feasible solution, finding an optimal solution once a feasible solution is found, something else? Are there logfiles anywhere of solver progression (e.g., iterations and status at each iteration) that would be worth looking at?
Have you experimented with different problem sizes (10k records, 50k records, 100k records, 200k records, or 5 constraints, 10 constraints, ...27 constraints), and if so was there a point at which execution time increased significantly?

To the extent that we can find time for this in our state tax data project, we'll set up a small testing environment that allows us to call a minimally modified version of solve_lp_for_year so that we can test impact of problem size, CBC parameters, and alternative PuLP solvers. I doubt we can find time within our state tax data project to examine non-PuLP approaches or non-LP approaches. I am confident that the latter can speed things up dramatically but are outside scope. They might be a good topic for a screenshare demonstration in a month or two, if you have time, simply to motivate the topic. But first it makes sense to see whether CBC/CLP parameter options and PuLP solver options make a meaningful difference.

We'll report back after some effort (and, hopefully, progress).

andersonfrailey commented 5 years ago

Do you know whether the solver ever is memory bound - using all memory and swapping out to hard drive? Or is the problem just extremely CPU-intensive?

I believe that it's just a CPU-intensive process. Though it does use a lot of memory as well.

Do you ever find that a feasible solution cannot be found and if so is your response to increase the tol parameter in the call to solve_lp_for_year?

Yes and yes. We typically need a higher tol in later years, and the tol levels we're currently using have typically been high enough for solutions to found.

Do you know where the solver spends the bulk of its time - e.g., initializing, finding a feasible solution, finding an optimal solution once a feasible solution is found, something else? Are there logfiles anywhere of solver progression (e.g., iterations and status at each iteration) that would be worth looking at?

I'm not sure which take the most time, but a bulk of the time is spent initializing and finding an optimal solution. The solver doesn't save any log files, it just prints out to the terminal. I have never found the output particularly informative. It's basically just a string of numbers associated with each iteration and I haven't been able to find any documentation on what those numbers mean.

Have you experimented with different problem sizes (10k records, 50k records, 100k records, 200k records, or 5 constraints, 10 constraints, ...27 constraints), and if so was there a point at which execution time increased significantly?

I have not. But we use it with both the CPS and PUF files (400K+ records and 250K+ records, respectively) and there is about an hour or so difference in how long it takes to run.

donboyd5 commented 5 years ago

Perfect, @andersonfrailey, thank you.

@marshallpku, @gchen3, and I talked about this yesterday. We will create a small standalone environment in which we can call minimally modified versions of solve_lp_for_year so that we can examine performance with different:

CBC options
PuLP solvers
scaling
Possibly non-PuLP solvers if we can wrap them in a function that allows using the current problem setup so that it doesn't have to be written differently (several papers suggest that CBC/CLP are among the top open-source LP solvers (although relative performance is highly dependent upon problem structure) so this does not a priori seem promising)
problem sizes (this is for investigative purposes only, not for potential adjustments - obviously solve_lp_for_year needs to be able to solve problems at least as large as it is solving now and I don't think it makes sense to go to the work of dividing it into a series of subproblems)

This seems to define the space of minimal adjustments (only requiring changes to a line or two of code) that might - might - result in performance improvements. If any of 1-3, and maybe 4, above appears promising we will propose an adjustment, along with its rationale.

As noted previously, there are other approaches that can speed things up by 1-2 orders of magnitude but they don't fall into the minimal adjustment domain we want to look at here.

In any event, we plan to declare success or failure early (in about 2 weeks) as this is a detour within our state income tax data project and we'll need to be back on the main road by then.

donboyd5 commented 5 years ago

Success!

Of likely interest to @andersonfrailey, @MattHJensen, and @martinholmer.

Also of interest to Gang Chen (@gchen3), who has been overseeing the reweighting work at Rockefeller College.

Yimeng Yin (@marshallpku) has achieved an approximately 6-fold speed-up of reweighting for the CPS and the PUF. It involves minimal changes to the code, swapping out the PuLP/CBC linear program solver for the CVXOPT solver and a few lines of code to prepare for and call the CVXOPT solver. He has confirmed that the results (a) satisfy all constraints (targets for variables), (b) achieve the same approximate minimum for the objective function, and (c) result in approximately the same set of weight adjustments as the PuLP/CBC solution.

Yimeng would be pleased to give a brief web presentation on what he has done. After that Yimeng can make a pull request so that you can see the code and decide whether to accept it. We would prefer to do the presentation before the PR because none of us has ever done a PR, and you may have advice during the presentation about how best to do it. However, if you prefer to have the PR first, Yimeng can do that.

If you are interested in seeing a web presentation, could you please say so, either here or to our email addresses (yyin@albany.edu, gchen3@albany.edu, and donboyd5@gmail.com), and Yimeng will set it up. Many thanks.

(Yimeng, please feel free to add further comments if you want.)

MattHJensen commented 5 years ago

Great news.

I would like to attend the web presentation.

If you are interested in using the next Tax-Calculator developer call as a forum for the presentation (Nov. 7 at 3 pm Eastern), I am happy to set aside as much time as you think you'll need.

donboyd5 commented 5 years ago

Hi Matt,

That would be great. If Yimeng could have 20 mins that would be ideal.

Don

Sent by an android

On Tue, Oct 29, 2019, 2:53 PM Matt Jensen notifications@github.com wrote:

Great news.

I would like to attend the web presentation.

If you are interested in using the next Tax-Calculator developer call as a forum for the presentation (Nov. 7 at 3 pm Eastern https://calendar.google.com/calendar/r/week/2019/11/7?eid=MXF2N2U5YzJzMzBvOWMwazA4YmwyMDhqaHFfMjAxOTExMDdUMjAwMDAwWiBwc2xtb2RlbHMub3JnXzN1Z3E2NmJ1NWpmbTY0MmY5aWk0aWt0c3AwQGc&ctz=America/New_York&sf=true), I am happy to set aside as much time as you think you'll need.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PSLmodels/taxdata/issues/326?email_source=notifications&email_token=ABR4JGGUMMDPQIOXDO6A733QRCBBNA5CNFSM4I5B7A5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECRVXXA#issuecomment-547576796, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR4JGAQN3XW4BBVJ7OAPILQRCBBNANCNFSM4I5B7A5A .

MattHJensen commented 5 years ago

If Yimeng could have 20 mins that would be ideal.

Certainly. If we end up needing more time for questions and discussion, that will be fine too. I'll schedule this first, so 3p Eastern on Nov 7.

donboyd5 commented 5 years ago

Great! Thanks. Don

On Thu, Oct 31, 2019 at 10:57 AM Matt Jensen notifications@github.com wrote:

If Yimeng could have 20 mins that would be ideal.

Certainly. If we end up needing more time for questions and discussion, that will be fine too. I'll schedule this first, so 3p Eastern on Nov 7.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PSLmodels/taxdata/issues/326?email_source=notifications&email_token=ABR4JGCRCDELE5X45363KLTQRLW5VA5CNFSM4I5B7A5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYCVLA#issuecomment-548416172, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR4JGB234LTNSRSTZANBH3QRLW5VANCNFSM4I5B7A5A .

andersonfrailey commented 5 years ago

This is great news, @donboyd5! I have class during the Nov. 7th call, but I'll be happy to review the PR with the changes when it's up.

donboyd5 commented 5 years ago

Great, Anderson! Yimeng will provide a link to the presentation when he has it, in case you have questions or want a separate discussion on it.

Don

On Fri, Nov 1, 2019 at 4:56 PM andersonfrailey notifications@github.com wrote:

This is great news, @donboyd5 https://github.com/donboyd5! I have class during the Nov. 7th call, but I'll be happy to review the PR with the changes when it's up.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PSLmodels/taxdata/issues/326?email_source=notifications&email_token=ABR4JGC4BTIORBRMYLF7CC3QRSJZFA5CNFSM4I5B7A5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC4EQTY#issuecomment-548948047, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR4JGF4CYJOQVSKMAHI3ODQRSJZFANCNFSM4I5B7A5A .

yimengyin16 commented 5 years ago

Hi all,

Here is the link to the slides for my presentation this afternoon.

Look forward to talking to you!

Yimeng

MattHJensen commented 5 years ago

@yimengyin16, I investigated the licensing issue further, and there is nothing to worry about.

The license of CVXOPT does not affect our licensing of TaxData so long as we require our users to install CVXOPT (by adding it to environment.yml to replace pulp). It would only be a matter of concern if we distributed the CVXOPT source code as part of TaxData.

Thanks again for the great presentation last week.

yimengyin16 commented 4 years ago

@MattHJensen, thank you very much for investigating the licensing issue!

andersonfrailey commented 4 years ago

This is great, @yimengyin16. Thanks for all of your work on this. I'm also curious to see how the Scipy LP solver performs, but don't feel like you need to hold off on opening a PR using CVXOPT. I'd love to try running it on my machine to see how everything goes.

PSLmodels / taxdata

Improving speed and flexibility of TaxData extrapolation #326