ecpolley / SuperLearner

Current version of the SuperLearner R package
273 stars 72 forks source link

Multi-class Classification #16

Open JackStat opened 9 years ago

JackStat commented 9 years ago

Is there any work on adding multi-class classification capabilities? Maybe we could start something with gbm.

ecpolley commented 9 years ago

I haven't been working on adding multi-class classification capabilities to the existing code. In practice, I often split the multi-class problem into a collection of binary classification problems. Say you have 3 classes (A, B, and C), you could fit binary classifiers for A vs B or C, B vs A or C, and C vs A or B then combine the results to make a classification passed on the highest probability estimate. The probability estimates are not correct because they are not contained to sum to 1, but the approach does allow flexibility for the classifier for the different categories. Here is a quick example

## multi-class classification
library(SuperLearner)
set.seed(843)
N <- 100

# outcome
Y <- sample(c("A", "B", "C"), size = N, replace = TRUE, prob = c(.1, .5, .4))

# variables
X1 <- rnorm(n = N, mean = (as.numeric(Y == "A") + .5*(as.numeric(Y == "C"))), sd = 1)
X2 <- rnorm(n = N, mean = (as.numeric(Y == "B")), sd = 1)
X3 <- rnorm(n = N, mean = (-1*as.numeric(Y == "B" | Y == "C")), sd = 1)
X4 <- rnorm(n = N, mean = X2, sd = 1)
X5 <- rnorm(n = N, mean = (X1*as.numeric(Y == "A") + as.numeric(Y == "A" | Y == "C")), sd = 1)

DAT <- data.frame(X1, X2, X3, X4, X5)

# test Data
# outcome
M <- 10000
Y_test <- sample(c("A", "B", "C"), size = M, replace = TRUE, prob = c(.1, .5, .4))

# variables
X1_test <- rnorm(n = M, mean = (as.numeric(Y_test == "A") + .5*(as.numeric(Y_test == "C"))), sd = 1)
X2_test <- rnorm(n = M, mean = (as.numeric(Y_test == "B")), sd = 1)
X3_test <- rnorm(n = M, mean = (-1*as.numeric(Y_test == "B" | Y_test == "C")), sd = 1)
X4_test <- rnorm(n = M, mean = X2_test, sd = 1)
X5_test <- rnorm(n = M, mean = (X1_test*as.numeric(Y_test == "A") + as.numeric(Y_test == "A" | Y_test == "C")), sd = 1)

DAT_test <- data.frame(X1 = X1_test, X2 = X2_test, X3 = X3_test, X4 = X4_test, X5 = X5_test)

# figure
# library(GGally)
# DAT2 <- data.frame(Y, DAT)
# ggpairs(DAT2, color = "Y")

# create the 3 binary variables
Y_A <- as.numeric(Y == "A")
Y_B <- as.numeric(Y == "B")
Y_C <- as.numeric(Y == "C")

# simple library, should include more classifiers
SL.library <- c("SL.gbm", "SL.glmnet", "SL.glm", "SL.knn", "SL.gam", "SL.mean")

# least squares loss function
fit_A <- SuperLearner(Y = Y_A, X = DAT, newX = DAT_test, SL.library = SL.library, verbose = FALSE, method = "method.NNLS", family = binomial(), cvControl = list(stratifyCV = TRUE))
fit_B <- SuperLearner(Y = Y_B, X = DAT, newX = DAT_test, SL.library = SL.library, verbose = FALSE, method = "method.NNLS", family = binomial(), cvControl = list(stratifyCV = TRUE))
fit_C <- SuperLearner(Y = Y_C, X = DAT, newX = DAT_test, SL.library = SL.library, verbose = FALSE, method = "method.NNLS", family = binomial(), cvControl = list(stratifyCV = TRUE))

SL_pred <- data.frame(pred_A = fit_A$SL.predict[, 1], pred_B = fit_B$SL.predict[, 1], pred_C = fit_C$SL.predict[, 1])
Classify <- apply(SL_pred, 1, function(xx) c("A", "B", "C")[unname(which.max(xx))])
table(Classify, Y_test)
ledell commented 9 years ago

Multi-class classification is something I have thought about adding. A reasonable way to implement this is using multiple response linear regression (MLR). Details in this paper: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume10/ting99a.pdf

JackStat commented 9 years ago

You should be able to optimize weights of different models given the multi-class logloss function right?

ecpolley commented 9 years ago

Yes, if each base learner in the library output a vector of predicted probabilities for the classes, you could define a convex combination of the predicted probabilities based on minimizing the V-fold cross-validated multi-class log loss estimate. Can you suggest some examples for base learners that return probability vectors?

JackStat commented 9 years ago

Sorry for the long delay. Here are a couple. randomForest is probably easiest.


library(xgboost)

param <- list("objective" = "multi:softprob",
              "eval_metric" = "mlogloss",
              "num_class" = 9)

y = iris[,'Species']
y = as.numeric(y)

x = iris[,1:4]
x = as.matrix(x)
x = matrix(as.numeric(x),nrow(x),ncol(x))

bstG = xgboost(param=param, data = x, label = y, nrounds=100)

xgG = predict(bstG,x)
xgG = matrix(xgG,4,length(xgG)/4)
xgG = t(xgG)

####################

library(randomForest)

rr <- randomForest(Species ~ ., iris)
predict(rr, type = 'prob')
ck37 commented 8 years ago

Polymars was also designed specifically for multiple classification (http://projecteuclid.org/euclid.aos/1031594728 part 6 on "polyclass").

ck37 commented 8 years ago

Looks like the code for a bunch of wrappers already exists! We just need to integrate it:

ae-tate commented 3 years ago

Was this ever implemented? I keep running into errors when trying it out with SL.glmnet. The links ck37 posted are unfortunately down.

mrubinst757 commented 3 years ago

Same question; would be great if this were an option