Which NLP solver to use?

PSLmodels / PUF-State-Distribution

MIT License

2 stars 1 forks source link

Which NLP solver to use? #1

Open MattHJensen opened 6 years ago

MattHJensen commented 6 years ago

@donboyd5, @evtedeschi3 please feel free to rename as you see fit. You two are the admins.

donboyd5 commented 6 years ago

Are you wedded to IPOPT? What about other options in R like nleqslv?

I am not an expert on nonlinear solvers, but have worked with many and have developed satisficing preferences after extensive experimentation.

A) We need an NLP solver that, at a miniumum: 1) Can impose bounds on variables (most can), sometimes called box constraints. In our case, our variables (the x[i] adjustment factors that are multiplied by record weights to arrive at adjusted weights) will need to be >=0 and we may want to impose upper bounds, too.

2) Can handle linear equality and inequality constraints. In our case, an equality constraint might be that we want the number of returns calculated with the new, adjusted weights, to equal 12 million for the state of California. An inequality constraint might be that we want total capital gains income in the $75-100k agi range in California, calculated using the new, adjusted weights, to fall between $42 billion and $44 billion. (These numbers are all made up.)

Any NLP solver that can do this, AND solve our problem (no matter how large or complicated it gets), will work.

B) However, some variants of our problem can get large, complicated, and hard to solve. In my experience the following attributes of an NLP solver are also valuable:

1) It can handle sparse constraint-coefficient matrices.

Suppose you have 2 constraints, where x[i] is the adjustment factor we are solving for, wt[i] is the set of weights we are starting with, and cg and pension are capital gains and pensions income on the file, respectively:

a) The number of returns with positive capital gains must equal 1,000: left hand side: sum over i: x[i] wt[i] (cg[i] > 0) RHS: 1000

b) The number of returns with positive pension income must equal 10,000 LHS: sum over i: x[i] wt[i] (pension[i] > 0) RHS: 10000

A constraint coefficient (cc) is the derivative of the LHS with respect to x[i] - how much will the constraint value on the adjusted file change if we change x[i]? In equation a, the cc is wt[i] for those records where cg[i] is > 0, and zero for all other records (probably most of them). (If we increase x[i] by 1, the value of the LHS of constraint a will change by wt[i] for returns that have capital gains.)

Similarly for equation b.

In our case, all of our constraints will be linear functions of x[i], like those above, with no interaction between x[i] and x[something else], and so all of our constraint coefficients will be constants (which is nice for us - they don't have to be recalculated with each new guessed-at set of x[i]'s).

A constraint coefficient matrix (ccm) is a matrix with a row for each i (150k rows, if we are looking at a single state in isolation) and a column for each constraint. In our case, it would be a 150k x 2 matrix. In column 1, corresponding to constraint a, the values would be wt[i] for those records with capital gains, and zero otherwise. Similarly for column 2.

Perhaps only 10% of the elements of the 300k element ccm will be nonzero, but we have reserved space for all 300k. A matrix with a lot of zeros is a sparse matrix.

Some nonlinear solvers take advantage of sparsity, storing only the nonzero values in a compact way (we'll get to that later) and also doing calculations with special methods, making them much more able to deal with very large problems. (Imagine that we got to a situation where we had 8 million records and 1,000 constraints, for an 8 billion element constraint coefficient matrix. Suppose further - and this is not unrealistic - that only 4% of the elements, or 320 million, were nonzero. It would be a good candidate for sparse matrix methods.)

Anyway, Ipopt can use sparse matrix methods making it well suited for large problems.

2) Ipopt is an interior point solver. Another common class of solver methods is based on generalized reduced gradient methods. Interior point methods seem to work very well on large convex problems like ours are likely to be. But I don't know much more about it than that.

3) Nonlinear solvers rely on linear solvers for part of their work. When problems get large, linear solvers that store the entire problem in memory (even on my 32 gb RAM machine) can choke and swap out memory, slowing them down. And some linear solvers simply work better than others. Ipopt allows you to decide which linear solvers to use, and in particular can use the Harwell Subroutine Library solvers, including one known as MA77, which does most of its work out of memory and is well-suited to huge problems.

So, Ipopt coupled with MA77 seems to be able to solve absolutely huge problems that have a structure similar to ours. I have had some NLP solver and linear solver combinations run for days and weeks, but I think this combination will generally solve our problems, robustly, in seconds or minutes.

That said, maybe what we are doing now won't get that large. Other NLP solvers may be perfectly appropriate. It's not really up to me to decide what we use. I am fine with experimentation. I have experimented with probably a dozen nonlinear solvers and a half-dozen linear solvers, and settled on this combination, but it is very hard to get it all working the very first time. After that, it is easy forever after.

So bottom line, whatever you find works for the problem, in acceptable time, should be fine.

As for nleqslv, I don't think that is an NLP solver - it solves systems of nonlinear equations, but I don't think is an optimizer.

feenberg commented 6 years ago

On Fri, 22 Dec 2017, Don Boyd wrote:

Are you wedded to IPOPT? What about other options in R like nleqslv?

I believe SNOPT from Stanford also allows sparse matrices and might be suitable. I looked at it when I was frustrated by all the different libraries IPOPT required, but got IPOPT working before actually tryiing SNOPT out.

Can I ask for a description of the object of this excercise? I thought it was to impute state of residence to PUF returns, but the example given is to modify weights given a state of residence. So the purpose is to take state of residence from the 2008 PUF and modify weights till the totals add up to published aggregates? What about the >200K returns?

dan

donboyd5 commented 6 years ago

I just elaborated on the objective, at https://github.com/open-source-economics/PUF-State-Distribution/blob/master/README.md - please see if that makes sense.

donboyd5 commented 6 years ago

I don't think SNOPT is available through R. It could be accessed by other means.

feenberg commented 6 years ago

On Fri, 22 Dec 2017, Don Boyd wrote:

I don't think SNOPT is available through R. It could be accessed by other means.

Yes, if R is required.

I wonder about the desirability of an out-of-memory linear solver. The total memory taken by this problem is large for a desktop, but quite reasonable for a modest size server. Even 64GB would allow the solver to stay in memory (with sparse matrix support) and I would have to believe that would be a great time-saver.

We can certainly use the system here - we have R and IOPT, although I have never combined them.

dan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVWPzhK-niAh55SYaPuDpeNVyXk_oks5tDBKGgaJpZM4RLQP7.gif]

donboyd5 commented 6 years ago

Thanks for offering use of the system you have. It is definitely good to know that is a possibility. I'm actually not sure yet how large or difficult this problem will become and I am very optimistic that it will be solvable on commodity equipment. A few comments:

The memory requirements of the approach you and I discussed early/mid 2017 were much greater than we have been discussing here. There we were talking about assigning 70k uncoded records to 50 states in a way that exhausted each record exactly. That required us to solve for 3.5 million proportions (50 proportions x 70k records) and to impose a constraint on every single record that the proportions, across states, add to 1 -- every record is used completely, whether assigned all to one state, 60% to one state and 40% to another, or 2% to each of 50 states, or something else. Thus, we had an NLP with 3.5 million variables, at least 70k constraints (plus all of our substantive targets), and it was relatively large and difficult. Still, it was solvable on commodity equipment in a matter of hours.
The problem we have been discussing here, so far, is much easier. We have been talking about distributing some portion of each record on a (let's say) 160k record file, to each of 50 (approx) states, without worrying about using each record's weight exactly -- we can re-use records or not, as needed. On its face, this might seem big -- we need 8 million variables (the proportion of each of 160k records to be distributed to each of 50 states) and, maybe, 5k constraints (25 substantive targets for each of, let's say, 4 income ranges, for each of 50 states). But each state is separable from the others (since we are not constraining a record across states) and each income range is separable, so that we can solve 200 separate problems: one state, one income range (50 x 4). The average problem size might entail about 40k records (the approx 40k records for a specific income range -- 160k / 4; of course some will be larger, some smaller) and the 25 constraints for that income range and state. Unless we find reasons to make the problem more complex (e.g., constraints across states or income ranges), this should be pretty easy to solve. (CAUTION: In my experience, problems almost always become more difficult than you think they are at the start.)
I would expect a problem with 40k variables and 25 constraints to solve in a matter of seconds. Solving 200 of these problems (50 states, 4 income ranges) shouldn't take more than 20 or 30 minutes, I think.
IF the problem becomes large, that does not necessarily mean an in-memory solver on a large machine will be better than an out-of-memory solver on a small machine. It is an empirical question. I have done a LOT of tests and timings of this. For example I have solved relatively large problems requiring, say, 24gb of RAM, all in (considering all requirements, and considering other memory demands on the computer) on a machine that had 32gb of RAM, both ways: in memory, and out of memory. To do this in memory, the operating system must allocate 24gb of RAM, which is a VERY expensive and time-consuming operation. As a result, the out-of-memory solver, which does not need to do this, can actually be much faster than an in-memory solver even on a problem that fits entirely in memory. The Ipopt/Harwell HSL combination allows a range of solvers that all are excellent, some of which are in memory (e.g., MA57) and one of which is out-of-memory (MA77).
But again, it is an empirical question, and if the problem becomes much larger than we have been discussing so far, and larger than what is practical on PCs, it is really good to know that we could do it on much larger and faster equipment.

feenberg commented 6 years ago

On Sat, 23 Dec 2017, Don Boyd wrote:

Thanks for offering use of the system you have. It is definitely good to know that is a possibility. I'm actually not sure yet how large or difficult this problem will become and I am very optimistic that it will be solvable on commodity equipment. A few comments:

The memory requirements of the approach you and I discussed early/mid 2017 were much greater than we have been discussing here. There we were talking about assigning 70k uncoded records to 50 states in a way that exhausted each record exactly. That required us to solve for 3.5 million proportions (50 proportions x 70k records) and to impose a constraint on every single record that the proportions, across states, add to 1 -- every record is used completely, whether assigned all to one state, 60% to one state and 40% to another, or 2% to each of 50 states, or something else. Thus, we had an NLP with 3.5 million variables, at least 70k constraints (plus all of our substantive targets), and it was relatively large and difficult. Still, it was solvable on commodity equipment in a matter of hours.

Yes, that is right.

The problem we have been discussing here, so far, is much easier. We have been talking about distributing some portion of each record on a (let's say) 160k record file, to each of 50 (approx) states, without worrying about using each record's weight exactly -- we can re-use records or not, as needed. On its face, this might seem big -- we need 8 million variables (the proportion of each of 160k records to be distributed to each of 50 states) and, maybe, 5k constraints (25 substantive targets for each of, let's say, 4 income ranges, for each of 50 states). But each state is separable from the others (since we are not constraining a record across states) and each income range is separable, so that we

I don't know what you mean by "each state is separable". If the aggregates by state are targets, doesn't that prevent each state from separating from the others in the same income range?

can solve 200 separate problems: one state, one income range (50 x 4). The average problem size might
entail about 40k records (the approx 40k records for a specific income range -- 160k / 4; of course some

I see, we give up the requirement that the probabilities sum to 1 for each state. Same number of variables but far fewer constraints. That makes some sense as a simplification. Does it change the result? Do you get negative weights? I worry that the randomization of weight sums per record might affect the accuracy of the federal calculation. Do we use the SOI weight for the federal aggregates and our random weights for the state level aggregates? I think that works but do the states add up to the federal?

Don't you get some negative weights? Doesn't that bother you?

will be larger, some smaller) and the 25 constraints for that income range and state. Unless we find
reasons to make the problem more complex (e.g., constraints across states or income ranges), this should
be pretty easy to solve. (CAUTION: In my experience, problems almost always become more difficult than
you think they are at the start.)

I am not doing to well at the moment. IPOPT claims no solution in the feasible region. Maybe I made a mistake somewhere, so I will work on it.

Oddly, IPOPT doesn't seem to offer a way to look at the values for X after each step, or any way to look at G when execution terminates. So I have to jury-rig something.

dan

IF the problem becomes large, that does not necessarily mean an in-memory solver on a large machine will be better than an out-of-memory solver on a small machine. It is an empirical question. I have done a LOT of tests and timings of this. For example I have solved relatively large problems requiring, say, 24gb of RAM, all in (considering all requirements) on a machine that had 32gb of RAM, both ways: in memory, and out of memory. To do this in memory, the operating system must allocate 24gb of RAM, which is a VERY expensive and time-consuming operation. As a result, the out-of-memory solver, which does not need to do this, can actually be much faster than an in-memory solver even on a problem that fits entirely in memory. But the Ipopt/Harwell HSL combination allows a range of solvers that all are excellent, some of which are in memory (e.g., MA57) and one of which is out-of-memory (MA77).

But again, it is an empirical question, and if the problem becomes much larger than we have been discussing so far, and larger than what is practical on PCs, it is really good to know that we could do it on much larger and faster equipment.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVYkkWUQzvqiyiYMroRd4c07HfRD0ks5tDOmRgaJpZM4RLQP7.gif]

donboyd5 commented 6 years ago

I see, we give up the requirement that the probabilities sum to 1 for each state. Same number of variables but far fewer constraints. That makes some sense as a simplification. Does it change the result? Do you get negative weights? I worry that the randomization of weight sums per record might affect the accuracy of the federal calculation. Do we use the SOI weight for the federal aggregates and our random weights for the state level aggregates? I think that works but do the states add up to the federal? Don't you get some negative weights? Doesn't that bother you?

If we do every state independently, we are letting the PUF data speak - what fraction of each record's weight should we use in order to hit the targets for the state? We would constrain that fraction so that it would be >= 0. Thus, we would never have negative weights, although zero weights might occur and might be desirable - some records drop out in some states. For example, suppose the file has a record with $50k agi and a $15k SALT deduction. If we knew where the records really came from (we don't), it might have been for some poor sucker in high-tax NY who is in the PUF sample. When we get around to targeting low-tax Mississippi, the method we use might use 0% of this record's weight, but it would never use a negative % of the weight - we wouldn't allow that.

The targets for each of the 50 states, by design, would be what we know (or believe) to be true for the 50 states. If we do a good job of hitting ALL of those targets, then by definition the sum of the values on our constructed file will equal the national totals, for all of the targeted variables, because the values for the states in https://www.irs.gov/statistics/soi-tax-stats-historic-table-2 add to the national total in that same source.

In other words, sums of targeted variables would equal known national totals. But that doesn't mean sums of calculated variables, such as federal tax under a different tax law, would equal the sum of same computed from the original file, although I suspect it would be close.

I think this would be an important check on our work and we would want, always, to investigate how close the totals are.

donboyd5 commented 6 years ago

IPOPT claims no solution in the feasible region. Maybe I made a mistake somewhere, so I will work on it. Oddly, IPOPT doesn't seem to offer a way to look at the values for X after each step, or any way to look at G when execution terminates. So I have to jury-rig something.

I don't know if you've asked the Ipopt list; there may be an easy way to retrieve the X values at each step when you are calling Ipopt from a C++ or Fortran program, thus returning them to your program. An alternative might be to obtain them as output. Ipopt has an option, file_print_level, that controls the amount of output written to a file. It can range from 0 to 12, with a default of 5. I don't think the documentation says what you get at the different levels; the Ipopt mailing list might, or you might just experiment. I have to believe writing the X's would be included in one of the higher levels.

I, too, have wanted to retrieve the X's. When calling from R, the Ipoptr interface to Ipopt does not let you do that. What I have done - and you could easily do, too, is put a statement in your objective function (often called eval_f) that writes the then-current X values to an external file so you can watch how they change from step to step. That only took about 10 minutes of programming time for me.

I don't know about G and am not sure what you mean -- do you mean the LHS of the constraints at various steps? (The function typically called eval_g evaluates those.) I am sure one of the approaches above would work. You can certainly retrieve them at the end. If you want them at individual steps you could put a similar write-to-file statement in eval_g as I did for the X's in eval_f. Or you could put a call to eval_g in eval_f, and write the results to a file.

The standard output lists a summary measure of constraint violations at each iteration; again, higher levels of file output might write the details.

I really hope we get to the point where the full recent (C++) version of Ipopt is working in multiple environments. As we've discussed, I can provide my Windows version to people with proper licenses, but it would be great to have it on Linux and Macs.

donboyd5 commented 6 years ago

IPOPT claims no solution in the feasible region. Maybe I made a mistake somewhere, so I will work on it.

If you have not used the derivative checker, I strongly recommend using it, if you have any doubt at all about whether your derivative calculation is correct. I have found mistakes that way - I sent Ipopt looking for solutions to a feasible problem in directions where a feasible solution could not exist.

Another way to check derivative calculations - and I admit this is embarrassing, and something that's probably entirely unnecessary with you - is with an online derivative calculator such as https://www.derivative-calculator.net/. I have used this, sheepishly, and found it to be useful.

feenberg commented 6 years ago

On Sun, 24 Dec 2017, Don Boyd wrote:

  I see, we give up the requirement that the probabilities
  sum to 1 for each state. Same number of variables but far
  fewer constraints. That makes some sense as a
  simplification. Does it change the result? Do you get
  negative weights? I worry that the randomization of weight
  sums per record might
  affect the accuracy of the federal calculation. Do we use
  the SOI weight for the federal aggregates and our random
  weights for the state level aggregates? I think that works
  but do the states add up to the federal? Don't you get
  some negative weights? Doesn't that bother you?
If we do every state independently, we are letting the PUF data speak

what fraction of each record's weight should we use in order to hit the targets for the state? We would constrain that fraction so that it would be >= 0. Thus, we would never have negative weights, although zero weights might occur and might be desirable - some records drop out in some states. For example, suppose the file has a record with $50k agi and a $15k SALT deduction. If we knew where the records really came from (we don't), it might have been for some poor sucker in high-tax NY who is in the PUF sample. When we get around to targeting low-tax Mississippi, the method we use might use 0% of this record's weight, but it would never use a negative % of the weight - we wouldn't allow that.

The targets for each of the 50 states, by design, would be what we know (or believe) to be true for the 50 states. If we do a good job of hitting ALL of those targets, then by definition the sum of the values on our constructed file will equal the national totals, for all of the targeted variables, because the values for the states in https://www.irs.gov/statistics/soi-tax-stats-historic-table-2 add to the national total in that same source.

In other words, sums of targeted variables would equal known national totals. But that doesn't mean sums of calculated variables, such as federal tax under a different tax law, would equal the sum of same computed from the original file, although I suspect it would be close.

I think this would be an important check on our work and we would want, always, to investigate how close the totals are.

If we have 150,000 records and 40 constraints, there are obviously an infinite number of solutions that meat the constraints. Your choice to "use the distribution of weights that is most similar to the national distribution that also meets the constraints" seems really plausible and worthwhile, but I still worry about negative weights. I have to guess that they are very rare, but even if that is true, I don't think it helps matters to constrain to zero them ex post. Won't that bias the result away from the know aggregate? Do you have any statistics on how often they occur? If they are rare, shouldn't we leave them in? If the are common, we should use another approach.

I am trying to puzzle out if the MaxEnt approach tends to make the states look like the national distribution, and I think it does. But as you point out there is considerable extra arithmetic with all those additional constraints.

dan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVWRf8lKN07ODCrU51E379oAyK83mks5tDjQugaJpZM4RLQP7.gif]

donboyd5 commented 6 years ago

Forcing the weights to be non-negative doesn't bother me intellectually. I am not sure what would happen if we didn't force it. I think it depends on (a) how we establish the initial weights (for example, whether they are simply scaled down national weights, or something else), and (b) the objective function, and how much it penalizes changes from those initial weights. If we use a maximum entropy objective function, you could not get negative weights because you would have to have a negative x[i] and you can't take a log of a negative number. For some of the other objective functions under consideration, a negative x[i] is possible to have in the function, but the penalty would be huge.

So I don't know, 100%, but I doubt it would be an issue even if we constrained the x[i]'s and therefore the weights to non-negative numbers.

In all of the runs I have done, I have always constrained the x[i]'s to be nonnegative so I don't have any statistics on what happens if you relax that requirement. With maximum entropy, the problem would simply blow up if we tried to have negative values because of the can't-take-a-log-of-a-negative-number issue. That objective function tends to spread the x[i]'s out in (0, 1), with clustering in the 0.2 to 0.3 range I think. I don't think the other objective functions would want x[i] to be negative, generally.

As for bias caused by forcing the weights to be non-negative, I am not sure I understand the concern. The results will hit the known aggregates. If there are weights that the objective function wants to be negative, they will instead be zero - the associated, presumably weird, records will drop out. That might be exactly what we want - such as the example I gave somewhere about the very-high-SALT-deduction record in the state of Mississippi, which would be a weird record there since it is such a low-tax state. I think we'd rather have it have zero weight than have a negative weight, even if the objective function, without a zero lower bound on the x[i]'s, wanted it to be negative.

My reaction is that we should start down the road and begin examining actual results. During the week I'll propose some next steps.

feenberg commented 6 years ago

On Sun, 24 Dec 2017, Don Boyd wrote:

As for bias caused by forcing the weights to be non-negative, I am not sure I understand the concern. The results will hit the known aggregates. If there are weights that the objective function wants to be negative, they will instead be zero - the associated, presumably weird, records will drop out.

I am confused. Are you constraining the weights to be >0 during the maximization, or are you accepting negative weights during maximization, but setting them to zero when simulations are done. If the weights are constrained to be >0, that adds a constraint for each taxpayer, which I thought you wanted to avoid. You would have the same number of constraints as MaxEnt, although they would be inequalities rather than equalities. Does that make a difference? I have no idea if it is an important difference - I have never done optimizations with inequality constraints.

I agree that there is no bias problem if the constraints are imposed during maximization.

dan

donboyd5 commented 6 years ago

During optimization. They are simply bounds on the variables. Not a computational difficulty at all.

On Dec 25, 2017 7:20 AM, "Daniel Feenberg" notifications@github.com wrote:

On Sun, 24 Dec 2017, Don Boyd wrote:

As for bias caused by forcing the weights to be non-negative, I am not sure I understand the concern. The results will hit the known aggregates. If there are weights that the objective function wants to be negative, they will instead be zero - the associated, presumably weird, records will drop out.

I am confused. Are you constraining the weights to be >0 during the maximization, or are you accepting negative weights during maximization, but setting them to zero when simulations are done. If the weights are constrained to be >0, that adds a constraint for each taxpayer, which I thought you wanted to avoid. You would have the same number of constraints as MaxEnt, although they would be inequalities rather than equalities. Does that make a difference? I have no idea if it is an important difference - I have never done optimizations with inequality constraints.

I agree that there is no bias problem if the constraints are imposed during maximization.

dan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-source-economics/PUF-State-Distribution/issues/1#issuecomment-353865771, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPEmHfO3Zb9srbJzUmV8mnScrJrSHITks5tD5L-gaJpZM4RLQP7 .