ecpolley / SuperLearner

Current version of the SuperLearner R package
272 stars 72 forks source link

method.CC class fails with duplicated algorithms #99

Closed benkeser closed 6 years ago

benkeser commented 7 years ago

I just ran into a bug with the CC methods. They don't like it when the columns of Z are duplicated. solve.QP throws an error because D is not pd. While it could be user error if methods end up duplicated (see my example below), it can also pop up e.g., when SL.glm and SL.step are used and SL.step lands on the full model. Here's an example that reproduces:

# simulate simple data
set.seed(1234)
n <- 100
A <- rbinom(n,1,0.5)
W <- data.frame(W1=rnorm(n),W2=rnorm(n))
Y <- A + W$W1 + W$W2 + rnorm(n)
# silly, but gets the point across
fit <- SuperLearner(Y = Y, X = data.frame(A=A,W=W),
                    SL.library = c("SL.glm","SL.glm","SL.mean","SL.mean"),
                    method="method.CC_LS")

Here's my proposed fix -- get rid of duplicated columns in Z before calling compute, throw a warning, and add in 0 weight for algorithms that are duplicated.

method.CC_LS_mod <- function()
{
    computeCoef = function(Z, Y, libraryNames, verbose, obsWeights, 
        ...) {
        cvRisk <- apply(Z, 2, function(x) mean(obsWeights * (x - 
            Y)^2))
        names(cvRisk) <- libraryNames
        compute <- function(x, y, wt = rep(1, length(y))) {
            wX <- sqrt(wt) * x
            wY <- sqrt(wt) * y
            D <- crossprod(wX)
            d <- crossprod(wX, wY)
            A <- cbind(rep(1, ncol(wX)), diag(ncol(wX)))
            bvec <- c(1, rep(0, ncol(wX)))
            fit <- quadprog::solve.QP(Dmat = D, dvec = d, Amat = A, 
                bvec = bvec, meq = 1)
            invisible(fit)
        }
        colDup <- which(duplicated(Z, MARGIN = 2))
        modZ <- Z
        if(length(colDup) > 0){
            warning(paste0("Algorithm ", colDup, " is duplicated. Setting weight to 0."))
            modZ <- modZ[,-colDup]
        }
        fit <- compute(x = modZ, y = Y, wt = obsWeights)
        coef <- fit$solution
        if(length(colDup) > 0){
            ind <- c(seq_along(coef),colDup-0.5)
            coef <- c(coef,rep(0,length(colDup)))
            coef <- coef[order(ind)]
        }
        if (any(is.na(coef))) {
            warning("Some algorithms have weights of NA, setting to 0.")
            coef[is.na(coef)] = 0
        }
        coef[coef < 1e-04] <- 0
        coef <- coef/sum(coef)
        if (!sum(coef) > 0) 
            warning("All algorithms have zero weight", call. = FALSE)
        list(cvRisk = cvRisk, coef = coef, optimizer = fit)
    }
    computePred = function(predY, coef, ...) {
        predY %*% matrix(coef)
    }
    out <- list(require = "quadprog", computeCoef = computeCoef, 
        computePred = computePred)
    invisible(out)
}
# try again
fit2 <- SuperLearner(Y = Y, X = data.frame(A=A,W=W),
                    SL.library = c("SL.glm","SL.glm","SL.mean","SL.mean"),
                    method="method.CC_LS_mod")
# should have 0 for second SL.glm and SL.mean
fit2
ck37 commented 7 years ago

Thanks David, this seems fair to me. Also helpful to know that CC_* are being used.

Out of curiosity, is there an advantage to using the CC_ algorithms over NNLS? Maybe the fact that it's a true convex combination makes it theoretically preferable?

benkeser commented 7 years ago

Sure thing.

I haven't noticed an appreciable difference in performance comparing NNLS vs. CC (though I haven't studied it directly). I think the CC method is better motivated by the theory in that the convex weights minimize cross-validated risk directly. NNLS (as far as I can tell) minimizes cv-risk over non-negative weights and then re-scales by the sum. However, coef/sum(coef) isn't necessarily the minimizer of cv-risk over convex weights (I don't think). Obviously, simulations have shown that NNLS works out pretty darn well, so probably not a big deal.

benkeser commented 7 years ago

It occurs to me that if you're interested in deploying this, it might be worth benchmarking the line

colDup <- which(duplicated(Z, MARGIN = 2))

in situations with large Z. duplicated is pretty fast, but might be slow in large data sets.

A faster option is to check for duplicated values in cvRisk. Of course, pathological situations might occur where two distinct methods give identical cv risks. I'll leave it to you guys to determine what's preferable.

Thanks for the response!

ck37 commented 7 years ago

Good point, maybe the best approach is check for duplication in cvRisk, and if there is duplication use duplicated() on Z to find the duplicated algorithms. Could even restrict the duplicated() analysis to the subset of Z with duplicates in cvRisk.

benkeser commented 7 years ago

Yep, that's probably the smart way to do it. Just a bit of bookkeeping to add.