optimize MLE - Githubissues

aappling-usgs commented 8 years ago

MLE models are much slower now than they could be with faster R implementation or, maybe better, Rcpp. There's also some funkiness with nlm+ode+lsoda that might possibly be solved with optim or nlminb (haven't tried)

[ ] if i'm going to optimize for time, i should probably start tracking runtime now
[ ] implement core loops in Rcpp a la https://github.com/GLEON/LakeMetabolizer/blob/master/src/mleLoop.c and http://adv-r.had.co.nz/Rcpp.html
[ ] consider nlminb (with constraints?) rather than nlm? http://stats.stackexchange.com/questions/9535/when-should-i-not-use-rs-nlm-function-for-mle, http://stackoverflow.com/questions/3757321/moving-beyond-rs-optim-function

aappling-usgs commented 8 years ago

Playing with implementing core loops in C for the call to ode(). It works! But would need a lot more to be integrated with the package. Here are the two files I wrote to experiment; I put the .c file in /src and ran the code from the .R file manually in the console.

src/m_np_oi_eu_plrckm.c

#include <R.h>

/* Import forcing data */
#define n_forcings 7
static double forcings[n_forcings];
#define DO_obs     forcings[0]
#define DO_sat     forcings[1]
#define depth      forcings[2]
#define temp_water forcings[3]
#define light      forcings[4]
#define KO2_conv      forcings[5]
#define err_proc      forcings[6]

void read_forcings (void (* odeforcs)(int *, double *))
{
  int N = n_forcings;
  odeforcs(&N, forcings);
}

// import params once for each set of ODE iterations
#define n_params 5
static double params[n_params];
#define timestep   params[0]
#define GPP_daily  params[1]
#define ER_daily   params[2]
#define K600_daily params[3]
#define mean_light params[4]

void read_params (void (* odeparms)(int *, double *))
{
  // odeparms is a function from deSolve that fills a double array with double 
  // precision values, to copies the parameter values into the global variable params
  int N=n_params;
  odeparms(&N, params);
}

// @param *neq  pointer to the number of equations
// @param *t    pointer to the value of the independent variable
// @param *y    pointer to a double precision array of length *neq that contains
//   the current value of the state variables
// @param *ydot pointer to an array that will contain the calculated derivatives
// @param *yout pointer to a double precision vector whose nout values are other
//   output variables (different from the state variables y), and the next 
//   values are double precision values as passed by parameter rpar when calling
//   the integrator. The key to the elements of *yout is set in *ip
// @param *ip   pointer to an integer vector whose length is at least 3; the
//   first element (ip[0]) contains the number of output values (which should be
//   equal or larger than nout), its second element contains the length of
//   *yout, and the third element contains the length of *ip; next are integer
//   values, as passed by parameter ipar when calling the integrator
#define dDOdt_mod ydot[0]
#define DO_mod    y[0]
#define GPP_inst  yout[0]
#define ER_inst   yout[1]
#define KO2_inst  yout[2]

void calc_dDOdt (int *neq, double *t, double *y, double *ydot, double *yout, int *ip)
{
  GPP_inst = GPP_daily * light / mean_light;
  ER_inst = ER_daily;
  KO2_inst = K600_daily * KO2_conv;

  dDOdt_mod = (
    (GPP_inst + ER_inst + err_proc) / depth +
      KO2_inst * (DO_sat - DO_mod)
  ) * timestep;
}

try_c_desolve.R

# In this file we use a compiled C function as the workhorse model in the ode() 
# call. this should be a lot faster than my very inefficient closure-based R
# implementation.

# typical data prep
library(streamMetabolizer)
library(dplyr)
data <- data_metab('1','30')[seq(1,48,by=2),]
dDOdt.obs <- diff(data$DO.obs)
preds.init <- as.list(dplyr::select(get_params(metab(specs(mm_name('mle', ode_method='euler')), data=data)), GPP.daily,ER.daily,K600.daily))
model_name <- "m_np_oi_eu_plrckm.nlm"

# data preparation specific to using a compiled C function in deSolve::ode()
params <- c(timestep=mm_get_timestep(data$solar.time), unlist(preds.init)) # it's a list above, but we need a vector w/ timestep first
if(isTRUE(mm_parse_name(model_name)$GPP_fun == 'linlight')) {
  # normalize light by the sum of light in the first 24 hours of the time window
  in_solar_day <- which(data$solar.time < data$solar.time[1] + as.difftime(1, units='days'))
  params['mean.light'] <- mean(data$light[in_solar_day])
}
dat <- data %>%
  mutate( # add time-dependent data & pre-calculated values
    t=1:n(), 
    KO2.conv=convert_k600_to_kGAS(k600=1, temperature=temp.water, gas='O2'), 
    err.proc=0) %>%
  select(t, DO.obs, DO.sat, depth, temp.water, light, KO2.conv, err.proc) %>%
  as.matrix() # need a double matrix
forcings <- lapply(2:ncol(dat), function(col) dat[,c(1,col)]) # need a list of 2-col matrices, 1 per variable

# compile and load the C function
# dyn.unload(paste("src/m_np_oi_eu_plrckm", .Platform$dynlib.ext, sep = "")) # only needed if experimenting with the C code
system("R CMD SHLIB src/m_np_oi_eu_plrckm.c")
dyn.load(paste("src/m_np_oi_eu_plrckm", .Platform$dynlib.ext, sep = ""))

# integrate from a dDOdt function to a timeseries of DO estimates
DO_mod <- ode(
  # inputs
  y = c(DO.mod=dat[[1,'DO.obs']]), 
  ynames = FALSE, # save a tiny bit of time by not passing names we won't use
  times = dat[,'t'], 
  parms = params, # time-independent inputs go here
  forcings = forcings, # time-dependent inputs go here. if needed, this can include err.proc
  fcontrol=list(method="linear", rule = 2, f = 0, ties = "ordered"), # specifies how forcings are interpolated

  # output description: number and names of variables to collect (in addition to time and DO.mod)
  nout = 3,
  outnames = c('GPP.inst','ER.inst','KO2.inst'),

  # function information. across model structures, everything here can be the
  # same except the dllname (and therefore the implementations of the 3 funcs)
  initforc = "read_forcings",
  initfunc = "read_params",
  func = "calc_dDOdt",
  dllname = "m_np_oi_eu_plrckm",

  # ODE integration method
  method = "rk4")

aappling-usgs commented 8 years ago

Things that would need to be done to implement compiled C functions for every MLE model:

Stop offering a manual euler / trapezoid integration and change deSolve from Suggests to Imports. Calling the C function directly would involve transforming the inputs from the format needed to call ode() into the format ode shares with its functions, and is probably more work than is useful.
Write a C-code generating function, much like mm_generate_mcmc_files. make sure it creates the files before they get compiled on package build
Configure the package so it builds the C files, which should be placed in the /src directory. RStudio will do a lot of this for us, but also see http://adv-r.had.co.nz/Rcpp.html#rcpp-package

Reduce the number of things that create_calc_dDOdt, create_calc_DO, and create_calc_NLL are used for:

calc_NLL is used by metab_mle to model NLL, which is fine, but also to extract parnames, aka metab.needs
calc_DO is used to make calc_NLL in metab_mle, and to model DO in calc_NLL and mm_predict_DO_1ply, all of which is fine
calc_dDOdt is used to create calc_DO and to model dDOdt in metab_mle and calc_DO, which is fine. But it's also used to extract metab.needs in specs, get_params, and calc_NLL. And it's used to extract teh ode_method in calc_DO. And it's used to extract DO.obs in calc_NLL. and it's used to extract ode_method and DO.obs.1 and t in calc_DO. And it's used to extract depth, t, GPP_inst(), ER_inst(), and D_inst() in mm_predict_metab_1ply. Most of those things, with the exception of GPP_inst() and ER_inst(), should happen elsewhere.

Put the extras that the old create_... functions offered somewhere else.

parnames/metab.needs could come out of specs, as it does for the Bayesian models.
ode_method should come from features(model_name)
DO.obs, DO.obs.1, t, and depth should come from data.

DOI-USGS / streamMetabolizer

optimize MLE #220