Feature request: MARS + ECM

crossxwill commented 3 years ago

I thought it would be fun to swap out lm with earth to get non-linear features into the equilibrium regression and transient terms. May cause some overfitting.

#'Build an error correction model with Multivariate Adaptive Regression Splines 
#'
#'Builds an earth object that represents an error correction model (ECM) by automatically differencing and
#'lagging predictor variables according to ECM methodology.
#'@param y The target variable
#'@param xeq The variables to be used in the equilibrium term of the error correction model
#'@param xtr The variables to be used in the transient term of the error correction model
#'@param includeIntercept Boolean whether the y-intercept should be included
#'@param weights Optional vector of weights to be passed to the fitting process
#'@param ... Additional arguments to be passed to the 'earth' function (careful in that these may need to be modified for ecm or may not be appropriate!)
#'@return an earth object representing an error correction model
#'@details
#'Similar to the original ecm function, but replaces the 'lm' estimator with the 'earth' estimator. 
#'Multivariate Adaptive Regression Splines (MARS) transforms each continuous variable with piecewise-linear hinge functions. This allows for non-linear features in both the transient and equilibrium terms.
#'In addition, MARS performs feature selection.
#''yLag1' is forced to be linear.
#'@seealso \code{earth}
#'@examples
#'##Not run
#'
#'library(ecm)
#'library(earth)
#'library(plotmo)
#'
#'#Use ecm to predict Wilshire 5000 index based on corporate profits, 
#'#Federal Reserve funds rate, and unemployment rate
#'data(Wilshire)
#'
#'#Use 2015-12-01 and earlier data to build models
#'trn <- Wilshire[Wilshire$date<='2015-12-01',]
#'
#'#Assume all predictors are needed in the equilibrium and transient terms of ecm
#'xeq <- xtr <- trn[c('CorpProfits', 'FedFundsRate', 'UnempRate')]
#'model1 <- ecm_earth(trn$Wilshire5000, xeq, xtr, includeIntercept=TRUE)
#'
#'plotmo(model1)
#'
#'#Use 5-fold cross-validation to perform variable selection
#'set.seed(2020)

#'model2 <- ecm_earth(trn$Wilshire5000, xeq, xtr, includeIntercept=TRUE, pmethod="cv", nfold=5)
#'
#'plotmo(model2)
#'
#'@export
#'@importFrom stats lm
ecm_earth <- function (y, xeq, xtr, lags = 1, includeIntercept = TRUE, weights = NULL, 
                       ...) 
{
  if (sum(grepl("^delta|Lag[0-9]$", names(xtr))) > 0 | 
      sum(grepl("^delta", names(xeq))) > 0) {
    warning("You have column name(s) in xeq or xtr that begin with 'delta' or end with 'Lag[0-9]'. It is strongly recommended that you change this, otherwise the function 'ecmpredict' may result in errors or incorrect predictions.")
  }
  if (!inherits(xtr, "data.frame") | !inherits(xeq, "data.frame")) {
    stop("xeq or xtr does not inherit class 'data.frame'. See details on how to input them as data frames.")
  }
  if (nrow(xeq) < (lags + 1)) {
    stop("Insufficient data for the lags specified.")
  }
  xeqnames <- names(xeq)
  xeqnames <- paste0(xeqnames, paste0("Lag", as.character(lags)))
  xeq <- data.frame(sapply(xeq, lagpad, lags))
  xtrnames <- names(xtr)
  xtrnames <- paste0("delta", xtrnames)
  xtr <- data.frame(apply(xtr, 2, diff, lags))
  if (class(y) == "data.frame") {
    if (ncol(y) > 1) {
      warning("You have more than one column in y, only the first will be used")
    }
    y <- y[, 1]
  }
  yLag <- y[1:(length(y) - lags)]
  x <- cbind(xtr, xeq[complete.cases(xeq), ])
  x <- cbind(x, yLag)
  names(x) <- c(xtrnames, xeqnames, paste0("yLag", as.character(lags)))
  x$dy <- diff(y, lags)
  if (includeIntercept){
    ecm <- earth::earth(dy ~ ., data = x, weights = weights, linpreds=c('yLag1'), ...)
  } else {
    ecm <- earth::earth(dy ~ . - 1, data = x, weights = weights, linpreds=c('yLag1'), ...)
  }
  return(ecm)
}

ecmpredict <- function (model, newdata, init) {
  if(class(model)=='earth'){
    coef_names = model$namesx
  } else{
    coef_names = names(model$coefficients)
  }

  lags <- as.integer(substring(tail(coef_names, 
                                    1), nchar(tail(coef_names, 1))))
  if (nrow(newdata) < (lags + 2)) {
    stop(paste0("Your input for 'newdata' has fewer rows than necessary to predict on a model with ", 
                lags, " lags."))
  }
  if (sum(grepl("^delta", coef_names)) >= 
      1) {
    form <- coef_names
    xtrnames <- form[grep("^delta", form)]
    xtrnames <- substr(xtrnames, 6, max(nchar(xtrnames)))
    xtr <- newdata[which(names(newdata) %in% xtrnames)]
    xtr <- data.frame(apply(xtr, 2, diff, lags))
    names(xtr) <- paste0("delta", names(xtr))
    xtrnames <- names(xtr)
  }
  if (sum(grepl("Lag[0-9]$", coef_names)) > 
      1) {
    form <- coef_names
    xeqnames <- form[grep("^(?!delta).*", form, perl = T)]
    if ("(Intercept)" %in% xeqnames) {
      xeqnames <- xeqnames[-c(1, length(xeqnames))]
    }
    xeqnames <- substr(xeqnames, 1, unlist(lapply(gregexpr("Lag", 
                                                           xeqnames), function(x) x[length(x)])) - 1)
    xeq <- newdata[which(names(newdata) %in% xeqnames)]
    names(xeq) <- paste0(names(xeq), "Lag", lags)
    xeqnames <- names(xeq)
  }
  if (exists("xeq")) {
    xeq <- data.frame(sapply(xeq, lagpad, lags))
  }
  if (exists("xeq") & exists("xtr")) {
    x <- cbind(xtr, xeq[complete.cases(xeq), ])
    x$yLag1 <- init
    names(x) <- c(xtrnames, xeqnames, paste0("yLag", 
                                             lags))
  }
  else if (!exists("xeq") & exists("xtr")) {
    x <- xtr
    x$yLag1 <- init
    names(x) <- c(xtrnames, paste0("yLag", lags))
  }
  else if (exists("xeq") & !exists("xtr")) {
    x <- as.data.frame(xeq[complete.cases(xeq), ])
    x$yLag1 <- init
    names(x) <- c(xeqnames, paste0("yLag", lags))
  }
  modelpred <- predict(model, x[1, ])
  for (i in 2:nrow(x)) {
    x$yLag1[i] <- x$yLag1[i - 1] + modelpred
    modelpred <- predict(model, x[i, ])
  }
  modelpred <- predict(model, x)
  modelpred <- cumsum(c(init, modelpred))
  return(modelpred)
}

gaurbans commented 3 years ago

I'm working on setting up 'ecm' and other functions to take in 'lm' or 'earth' as arguments for the linear fitter of choice. Is there a reason you've chosen to pass 'yLag1' as an argument for linpreds?

crossxwill commented 3 years ago

It's an attempt to preserve the "theory" of ECM (i.e., prior deviations from equilibrium are followed by a correction), but allowing for non-linear features to affect the equilibrium levels of 'y'. Suppose the equilibrium levels of 'y' is a parabolic function of 'x'.

Then the ECM equation would look like:

So yLag1 is still linear despite a non-linear transformation of 'x'. Generally speaking, we could allow for arbitrary non-linear transformations of the features. The 'earth' function would figure out the 'f' and 'g' functions.

Just my two cents. I haven't actually seen this done in literature because most economists assume 'f' and 'g' are linear.

gaurbans commented 3 years ago

Got it, thanks.

gaurbans commented 3 years ago

Sorry for the delay. The 'ecm' function can now fit models using the 'earth' algorithm, and 'ecmpredict' can predict on said models.

gaurbans / ecm

Feature request: MARS + ECM #4