astamm / nloptr

nloptr provides an R interface to NLopt, a free/open-source library for nonlinear optimization providing a common interface to a number of different optimization routines which can handle nonlinear constraints and lower and upper bounds for the controls.
https://astamm.github.io/nloptr/
Other
103 stars 36 forks source link

Python vs R Speed Differences #125

Closed KryeKuzhinieri closed 1 year ago

KryeKuzhinieri commented 1 year ago

Hi,

We have an old R code which we are trying to convert to python and it uses nloptr. Although both the implementations have the same input parameters, the converted python code works significantly slower than the R code. To demonstrate let's compare the R code with the python one.


# EVAL_F
SSE_m2 <- function(p) {
  Vc <- p[1]
  A <- p[2]
  cruiseDistance <- (Vc^2)/(2*A)

  #evaluate m(d) and c(d) functions
  m_d <- ifelse(
    ZoneData$distance<=2*cruiseDistance,
    log(2*sqrt(ZoneData$distance/A)),
    log((Vc/A)+(ZoneData$distance/Vc))
  )

  e_sq <- (m_d-log(ZoneData$travel_sec))^2
  SSE <- sum(e_sq)

  return(SSE)
}

# EVAL_GRAD
GradSSE_m2 <- function(p_m2){
  gradtest <- grad(SSE_m2,p_m2)
  return(gradtest)
}

# local options for nloptr
local_opts <- list( "algorithm" = "NLOPT_LD_MMA",
                    "xtol_rel"  = 1.0e-7 )

#global optimization options for Nloptr
optionslist <- list( "algorithm" = "NLOPT_LD_LBFGS",
                     "xtol_rel"  = 1.0e-7,
                     "maxeval"   = 1000,
                     "local_opts" = local_opts )

lbound_m2 <- c(rep(1e-317,2))
ubound_m2 <- c(50,2)
p_m2_start <- c(5, 1e-5)

result <- nloptr(x0=p_m2_start,
                         eval_f=SSE_m2,
                         eval_grad_f=GradSSE_m2,
                         lb=lbound_m2,
                         ub=ubound_m2,
                         opts=optionslist)

On the other hand, the python code is as follows:

p_m2_start = [5.e+00, 1.e-05]
ubound_m2 = [50, 2]
lbound_m2 = [1e-317, 1e-317]

def formula(x):
    vc = x[0]
    a = x[1]
    cruise_distance = vc**2 / (2*a)
    m_d = zone_data["distance"].apply(
        lambda x: np.log(2*np.sqrt(x/a))
        if x <= 2*cruise_distance
        else np.log((vc/a) + (x/vc))
    )
    e_sq = (m_d - np.log(zone_data["travel_sec"]))**2
    sse = sum(e_sq)
    return sse

def objective_function(x, grad):
    if grad.size > 0:
        gradients = nd.Gradient(formula)(x)
        grad[0] = gradients[0]
        grad[1] = gradients[1]
    sse = formula(x)
    return sse

opt = nlopt.opt(nlopt.LD_LBFGS, 2)
opt.set_min_objective(objective_function)
opt.set_lower_bounds(lbound)
opt.set_upper_bounds(ubound)
opt.set_xtol_rel(1.0e-7)
opt.set_maxeval(1000)
local_opt = nlopt.opt(nlopt.LD_MMA, 2)
local_opt.set_xtol_rel(1.0e-7)
opt.set_local_optimizer(local_opt)
y = opt.optimize(np.array(p_m2_start))

Now, both of these functions yield the same result in the end. However, there are the two issues:

  1. It takes more iterations for the python code to converge,
  2. It takes more time for python code

On first inspection, I can see that the python code calls the eval grad on every iteration whereas the R code calls the eval grad function every 7 or so iterations.

I went ahead and compiled the nloptr package on my local machine and I see that more operations are going under the hood. Whereas the python code is just calling the standard nlopt package. Would it be possible to help me out with some pointers on the extra steps that are taken in the R code so that we can speed up our algorithm in python?

Thanks a lot!

Note: This is not an issue or a bug in nloptr but I am asking for suggestions.

eddelbuettel commented 1 year ago

My first instinct would be to suspect that the Python package is built with a different (older) nlopt library, so I would try to ensure that I compiled the Python side of things locally the same way as the R side. Next might be parameterization.

Also, if you run Python from pre-made wheels they are built for maximum distribution and will have minimal optimisation. But that would likely not explain behavior like fewer calls to the eval.

KryeKuzhinieri commented 1 year ago

Hi @eddelbuettel. Your feedback was invaluable because you made me profile the entire function which helped identify the bottleneck.

It appears that the issue was with gradients = nd.Gradient(formula)(x) because it was taking too much time to find the gradient of the function. Changing it to gradients = nd.Gradient(formula, method="complex", order=0)(x) reduced the time quite significantly. Moreover, this helped me realize that the code could be optimized to make things much faster and I changed the pd.apply function to np.where which also increased the speed. But even though the code was much faster, it was still slower than the R version.

Eventually, I thought maybe using .backwards() from the pytorch library could help. And oh boy it did since the implementation now is much faster than the R version.

So, for people who might encounter similar problems. The steps I took where:

  1. Changed pandas df to pytorch tensors,
  2. Enable requires_grad for the x variable.
  3. Run:
    if grad.size > 0:
    inputs = torch.tensor(x, requires_grad=True)
    gradients = formula(inputs, grad=True)
    gradients.backward()
    grad[0] = inputs.grad[0].item()
    grad[1] = inputs.grad[1].item()
    sse = formula(x)

Thanks a lot for the suggestions!