Slow Convergence Times For Local Gradient Based Algos

jjoba commented 6 months ago

Hi! I've been using NLOPTR quite a bit recently and have noticed that the convergence for local gradient-based algorithms is slower than I would expect. For example in the code below, NLOPT_LD_LBFGS takes 185 iterations to converge where as L-BFGS-B in base R only takes 27. I also tried out NLOPT_LD_SLSQP and NLOPT_LD_MMA and both seem to get "stuck" or VERY slowly approach the optimum (if ever reaching). In addition, SLSQP will occasionally give radically different solutions on sequential runs (can't reproduce). SLSQP & MMA are both appealing due to the ability to include non-linear constraints.

Below is a reproducible example.

If there is a parameter I'm missing that would help or if you need additional information, please let me know. Thanks in advance for looking into this.

`

R version 4.2.2 64 bit on Windows 11

library(nloptr) # 2.0.3

set.seed(1234) n <- 300 # arbitrarily large number for testing x <- exp(rnorm(n)) b <- exp(rnorm(n)) # beta/coef

Objective function to minimize

f_obj <- function(x, n, b){ -sum(b * log(x) - x) }

Gradient of the objective function

f_grad <- function(x, n, b){ n - ((b * n) / x) }

Benchmark using base R implementation; takes 27 iterations to converge

optim(rep(2, n), f_obj, gr = f_grad, method = "L-BFGS-B", n = n, b = b, lower = rep(1, n))

Same as above using nloptr; takes 185 iterations to converge

nloptr( x0 = rep(2, n), eval_f = f_obj, eval_grad_f = f_grad, lb = rep(1, n), opts = list( "algorithm" = "NLOPT_LD_LBFGS", "xtol_rel" = 1e-08, # set to mimic optim's tolerance, roughly equivalent to factr = 1e7 "maxeval" = 200), # set to enable convergence n = n, b = b )

Optimal value of objective function = ~46.07

MMA makes progress in the first few iterations, then gets stuck at 70 and terminates at iteration 54

nloptr( x0 = rep(2, n), eval_f = f_obj, eval_grad_f = f_grad, lb = rep(1, n), opts = list( "algorithm" = "NLOPT_LD_MMA", "xtol_rel" = 1e-08, "maxeval" = 200, "print_level" = 3), # to show optimization path n = n, b = b )

SLSQP makes inefficient choices along the way -- each step is typically worse, not better, but does eventually get close to the optimum

nloptr( x0 = rep(2, n), eval_f = f_obj, eval_grad_f = f_grad, lb = rep(1, n), opts = list( "algorithm" = "NLOPT_LD_SLSQP", "xtol_rel" = 1e-08, "maxeval" = 200, # at 10K it will get down to 47, seemingly by luck "print_level" = 3), # set to enable convergence n = n, b = b ) `

aadler commented 6 months ago

Are you certain your gradient is correct? Shouldn't it be $n - \frac{b}{\sum{x}}$?

jjoba commented 6 months ago

I just double checked on wolfram alpha and got $n - \frac{bn}{x}$

I think the reason for the difference is that the gradient supplied is $w.r.t.$ each element in the vector $x$

jjoba commented 6 months ago

In case it's helpful, I was just able to get MMA to converge by running it in a loop and passing the ending parameters of the prior iteration to the next.

` best_solution <- rep(2, n) best_obj <- Inf i <- 1 while(TRUE){ print(paste("Iteration:", i, "Obj:", best_obj))

res <- nloptr(
    x0 = best_solution,
    eval_f = f_obj,
    eval_grad_f = f_grad,
    lb = rep(1, n),
    opts = list(
    "algorithm" = "NLOPT_LD_MMA", 
    "ftol_rel" = 1.0e-16,
    "print_level" = 0,
    "maxeval" = 1000),
    n = n,
    b = b
)

if(res$objective < best_obj){
    best_obj <- res$obj
    best_solution <- res$solution
    i <- i + 1
} else{
    break
}

} ``

aadler commented 6 months ago

I just double checked on wolfram alpha and got n−bnx

I think the reason for the difference is that the gradient supplied is w.r.t. each element in the vector x

I think that there may be a mistake as you're summing over $i$ but there is no $i$ in the $x$'s in your summand, so wrt $i$, $x$ is constant, which is why you are getting the extra $n$ in the numerator.

Consider the function on an individual observation: $\beta\ln(x) - x$. The derivative of that is, if I remember my calculus from 30-odd years ago, $\frac{\beta}{x} -1$. Sum that over all $x$ and you get $$\sum_{i = 1}^n{\frac{\beta}{xi}} - n$$ which when adjusted for the negative sign and pulling out constants becomes $$n - \beta\sum{i=1}^n{\frac{1}{x_i}}$$ does it not?

aadler commented 6 months ago

Furthermore, I think you're using the optimization incorrectly.

First, what precisely are you trying to optimize? Is that a loss distribution, is that a cost function which is always non-negative? If you can describe you objective that would help. Looking at the log-space form, the "normal space" form would need to be $$\frac{x^\beta}{e^{x}}$$. What precisely are you trying to solve?

What it looks like you are doing from your f_obj is trying to solve for a separate $\beta$ for each observation. If so, you are bound to have problems due to the polynomial interpolation theorem. Which states, given $n$ observations, one can always construct a degree $n - 1$ polynomial that goes through each and every point. So the "optimum" is that polynomial which teaches you nothing about the distribution (the ultimate in bias variance trade-off—no bias and infinite variance). Also, $n$ should not be a parameter of the distribution; it is merely the count of observations.

Perhaps you meant to find a SINGLE $\beta$ given the $x$. Well, the way you generated them, the mean of of both $\beta$ and $x$ is $1$. So I expect the mean of $\frac{x^\beta}{e^x}$ to be $\frac{1}{e}$ so the mean of the log should be $-1$, which it will be, since 1 * 0 - 1 = -1. However, in this case, using calculus, we would need to consider the function as one of $\beta$ and not $x$! In that case, taking the derivative with respect to $\beta$ and setting equal to 0, we get $$\frac{1}{e^x}x^\beta\ln(x) = 0$$ which is only possible when $x = 1$, having nothing to do with $\beta$.

So forgive me, but I'm struggling to understand just what is being minimized and if it even can be. I admit, although I use calculus every now and then, my last rigorous exposure was about 30 years ago, so I may be rusty.

jjoba commented 6 months ago

Thanks for the follow up!

Descriptively, here's what I'm trying to solve:

This is inspired by Media Mix Modelling (or more generally budget allocation). The goal is to find how much to "spend" (x) in each media channel (i) to maximize net revenue (return - channel spend). Although each channel has diminishing returns as spend increases (modeled by ln(x)), but the slope/effectiveness of each channel is different (b - using exponentiated normal here just so that all of the coefficients are positive).

Ultimately, the goal is to maximize net revenue $\sum{\beta ln(x) - x}$, but sign flipped to $-\sum{\beta ln(x) - x}$ in order to run minimization. It's also subject to the constraint $\sum{x} \leq Budget$ (not included in the code above, planning to add later on). The key variable I'm trying to solve for is the optimal x(i).

You're absolutely right about the derivative of each element. Interestingly, when I supply the gradient you derived to L-FBGS-B I get an abnormal termination in lnsrch error.

aadler commented 6 months ago

So your spends are the $x_i$. What is the variable that represents the return? Are you assuming each and every return is a function of the logarithm of the spend? So that the revenue for the $i^{th}$ item is $\beta_i\ln(x_i) - x_i$? So you have nonlinear programming problem:

Maximize $$\sum_{i} \beta_i\ln(x_i) - x_i$$ subject to $$\sum_i x_i \leq x_T$$

$$x_i \geq 0$$ where $x_T$ is the total available budget. You also need to know all the $\beta_i$ beforehand since they are parameters and not variables. Given you tested a 300-dimensional problem, you have no way of really knowing if the failure was in the algorithm or that the data you generated does not allow for convergence.

My suggestion is, if you can, start with very few variables. Even better, start with a restricted set for which you can analytically solve, if possible, and see if the algorithm converges. Only then scale up.

Also, given that I understand the problem a bit better, if you are trying to minimize $$\sum_{i}x_i - \beta_i\ln(x_i)$$ the gradient should be $$n - \sum_i \frac{\beta_i}{x_i}$$ Since each $\beta$ may be different, they cannot be pulled out of the sum.

jjoba commented 6 months ago

Thanks for all of the suggestions. I just had a new baby, so it will be quite a while before I'm able to follow up here on this again.

Based on where we're landing this isn't truly an issue with NLOPTR and more an issue with problem formulation, so closing out the issue issue so it doesn't remain in open status for too long. I'll follow backup if I'm still encountering issues after I get back to the project.

astamm / nloptr