GBM example and possible problem

bwlewis commented 8 years ago

@lixzhang Reported trouble publish an R GBM model as a service. The following example illustrates how to do this using the example included in the gbm package:

library(gbm)

# Example from ?gbm:
set.seed(1)
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

  gbm1 <-
  gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
      data=data,                   # dataset
      var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                   # +1: monotone increase,
                                   #  0: no monotone restrictions
      distribution="gaussian",     # see the help for other choices
      n.trees=1000,                # number of trees
      shrinkage=0.05,              # shrinkage or learning rate,
                                   #0.001 to 0.1 usually work
      interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
      train.fraction = 0.5,        # fraction of data for training,
                                   # first train.fraction*N used for training
      n.minobsinnode = 10,         # minimum total weight needed in each node
      cv.folds = 3,                # do 3-fold cross-validation
      keep.data=TRUE,              # keep a copy of the dataset with the object
      verbose=FALSE,               # don't print out progress
      n.cores=1)                   # use only a single core (detecting #cores is
                                   # error-prone, so avoided here)

  # check performance using 5-fold cross-validation
  best.iter <- gbm.perf(gbm1, method="cv", plot=FALSE)

# Record factor levels for posterity, see ?publishWebService...
data_levels <- data[1,]

# Publish this model as an AzureML web service
fun <- function(newdata)
{
  require(gbm)
  # code factor levels, see ?publishWebService
  newdata$X3 <- factor(newdata$X3, levels=data_levels$X3, ordered=TRUE)
  newdata$X4 <- factor(newdata$X4, levels=data_levels$X4)
  newdata$X5 <- factor(newdata$X5, levels=data_levels$X5)
  predict(gbm1, newdata, best.iter)
}
library(AzureML)
ws <- workspace()
ep <- publishWebService(ws, fun=fun, name="gbm", packages="gbm", inputSchema=data)

# Generate new data
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE))
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma)
data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# predict on the new data using "best" number of trees
# f.predict generally will be on the canonical scale (logit,log,etc.)
f.predict <- predict(gbm1,data2,best.iter)

# Use the AzureML web service to predict the same thing
a.predict <- consume(ep, data2)

# least squares error
print(crossprod(data2$Y-f.predict))
print(crossprod(data2$Y-a.predict$ans))

The model gets published and runs. However, I am concerned that, at least on my system, I get substantially different results from the local prediction and the one computed in AzureML.

Note that we don't see this inconsistency in the examples using glm or lmer in the help page for ?publishWebService.

The problem might be that I'm building the model locally using my version of R and a possibly different version of the gbm package than the R instance on AzureML uses. Perhaps a change in gbm between the quite old version available in AzureML and more recent versions is causing a problem.

Further investigation is warranted.

bwlewis commented 8 years ago

Is it possible for an AzureML instance to mutate its state? Then we could build the model once in AzureML the first time things were run and then predict off of that model, avoiding any possibility of version inconsistency.

But so far I can't seem to find a way to get an instance running within AzureML to update itself...

bwlewis commented 8 years ago

@lixzhang points out that the results are in fact the same using the 'fun' function, so this is not really an issue, at most there is a bug in the 'fun' prediction function implementation.

Here is a better script:

library(gbm)

# Example from ?gbm:
set.seed(1)
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

  gbm1 <-
  gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
      data=data,                   # dataset
      var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                   # +1: monotone increase,
                                   #  0: no monotone restrictions
      distribution="gaussian",     # see the help for other choices
      n.trees=1000,                # number of trees
      shrinkage=0.05,              # shrinkage or learning rate,
                                   # 0.001 to 0.1 usually work
      interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
      train.fraction = 0.5,        # fraction of data for training,
                                   # first train.fraction*N used for training
      n.minobsinnode = 10,         # minimum total weight needed in each node
      cv.folds = 3,                # do 3-fold cross-validation
      keep.data=TRUE,              # keep a copy of the dataset with the object
      verbose=FALSE,               # don't print out progress
      n.cores=1)                   # use only a single core (detecting #cores is
                                   # error-prone, so avoided here)

  # check performance using 5-fold cross-validation
  best.iter <- gbm.perf(gbm1, method="cv", plot=FALSE)

# Record factor levels for posterity, see ?publishWebService...
data_levels <- data[1,]

# Publish this model as an AzureML web service
fun <- function(newdata)
{
  require(gbm)
  # code factor levels, see ?publishWebService
  newdata$X3 <- factor(newdata$X3, levels=data_levels$X3, ordered=TRUE)
  newdata$X4 <- factor(newdata$X4, levels=data_levels$X4)
  newdata$X5 <- factor(newdata$X5, levels=data_levels$X5)
  predict(gbm1, newdata, best.iter)
}
library(AzureML)
ws <- workspace()
ep <- publishWebService(ws, fun=fun, name="gbm", packages="gbm", inputSchema=data)

# Generate new data
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE))
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma)
data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# predict on the new data using "best" number of trees
# f.predict generally will be on the canonical scale (logit,log,etc.)
#f.predict <- predict(gbm1,data2,best.iter)
f.predict <- fun(data2)

# Use the AzureML web service to predict the same thing
a.predict <- consume(ep, data2)

# least squares error
print(crossprod(data2$Y-f.predict))
print(crossprod(data2$Y-a.predict$ans))

However that does not mean that a version mismatch could not cause trouble in some packages, and users need to be aware of this. I think the concept of building a model in one version of R and then using it in another version is flawed. It's a shame there isn't a way to build the model in Azure then publish it there...

bwlewis commented 8 years ago

OK, sorry folks. This issue was largely due to my buggy code, I simply forgot a call to levels. See the revised, working code below.

However the resulting caveats about package versioning is still something important to consider.

library(gbm)

# Example from ?gbm:
set.seed(1)
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

  gbm1 <-
  gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
      data=data,                   # dataset
      var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                   # +1: monotone increase,
                                   #  0: no monotone restrictions
      distribution="gaussian",     # see the help for other choices
      n.trees=1000,                # number of trees
      shrinkage=0.05,              # shrinkage or learning rate,
                                   # 0.001 to 0.1 usually work
      interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
      train.fraction = 0.5,        # fraction of data for training,
                                   # first train.fraction*N used for training
      n.minobsinnode = 10,         # minimum total weight needed in each node
      cv.folds = 3,                # do 3-fold cross-validation
      keep.data=TRUE,              # keep a copy of the dataset with the object
      verbose=FALSE,               # don't print out progress
      n.cores=1)                   # use only a single core (detecting #cores is
                                   # error-prone, so avoided here)

  # check performance using 5-fold cross-validation
  best.iter <- gbm.perf(gbm1, method="cv", plot=FALSE)

# Record factor levels for posterity, see ?publishWebService...
data_levels <- data[1,]

# Publish this model as an AzureML web service
fun <- function(newdata)
{
  require(gbm)
  # code factor levels, see ?publishWebService
  newdata$X3 <- factor(newdata$X3, levels=levels(data_levels$X3), ordered=TRUE)
  newdata$X4 <- factor(newdata$X4, levels=levels(data_levels$X4))
  newdata$X5 <- factor(newdata$X5, levels=levels(data_levels$X5))
  predict(gbm1, newdata, best.iter)
}
library(AzureML)
ws <- workspace()
ep <- publishWebService(ws, fun=fun, name="gbm", packages="gbm", inputSchema=data)

# Generate new data
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE))
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma)
data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# predict on the new data using "best" number of trees
# f.predict generally will be on the canonical scale (logit,log,etc.)
f.predict <- predict(gbm1,data2,best.iter)

# Use the AzureML web service to predict the same thing
a.predict <- consume(ep, data2)

# least squares error
print(crossprod(data2$Y-f.predict))
print(crossprod(data2$Y-a.predict$ans))

lixzhang commented 8 years ago

Nice catch about level() argument.

Any insights about why my example code below does not run successfully at the consume() step? The error I got was:

#=========================================================
# set up 
#=========================================================
# Install devtools
if(!require("devtools")) install.packages("devtools")
devtools::install_github("RevolutionAnalytics/AzureML")
library(AzureML)

ws <- workspace(
)

library(MASS)

test <- Boston[1,]

#=========================================================
# gbm model
#=========================================================
# ensure results are repeatable
set.seed(123)
# load the library
library(gbm)
# fit the model
gbm1 <- gbm(medv ~ ., 
            distribution = "gaussian", 
            n.trees = 5000, 
            interaction.depth = 8, 
            n.minobsinnode = 1,
            shrinkage = 0.01, 
            cv.folds = 5,
            data = Boston)
print(gbm1) # print model
gbm.perf(gbm1) # check the optimal number of trees
summary(gbm1) # check variable importance

predict(gbm1, Boston[1,1:13])

# check performance using 5-fold cross-validation
best.iter <- gbm.perf(gbm1, method="cv", plot=FALSE)

#=========================================================
# web service
#=========================================================
# define function with gbm
mypredictnew <- function(crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat){
  newdata = data.frame(matrix(c(crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat), ncol = 13))
  names(newdata) = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat")
  res = predict(gbm1, newdata, best.iter)
  res
}

mypredictnew(0.00632,18,2.31,0,0.538, 6.575, 65.2,4.09,1,296,15.3,396.9, 4.98)

# Publish the service
ep <- publishWebService(ws = ws, fun = mypredictnew, name = "LixunWebService",
                        inputSchema = list(crim="numeric", zn="numeric", indus="numeric",
                                           chas="numeric", nox="numeric", rm="numeric", age="numeric", 
                                           dis="numeric", rad="numeric", tax="numeric", ptratio="numeric", 
                                           black="numeric", lstat="numeric"), 
                        outputSchema = list(z="numeric"),
                        packages="gbm")

# consume
results1 <- consume(ep, test)

bwlewis commented 8 years ago

A few things:

mypredictnew function of 13 arguments, but test <- Boston[1,]r has 14 columns. I think you mean test <- Boston[1, 1:13].
You need a require(gbm) line inside the predict function, just add that at the beginning of the function.

With those two changes it works. I understand that the documentation about item 2 above is not clear and I am making some changes to make that more explicit.

The packages option is simply an explicit way to bundle packages in the service, along with their dependencies, using miniCRAN. Your function should still use require to load required packages; for example some packages are already available in AzureML and don't need to be bundled, but they still should be explicitly required by the function at the heart of the service.

As it turns out, the gbm package and its dependencies are already available in AzureML so we don't need to explicitly bundle it with miniCRAN. I am not sure if there is a published list of exactly what is already available in the R AzureML environment, but one of the publishWebService examples includes an example that lists all the packages that happen to be there.

Also, I think that the data frame interface is more natural for this than specifying each scalar variable, and it makes the schema specification much easier and less error prone. Here is a modified example that works:

library(AzureML)
library(MASS)
library(gbm)

ws <- workspace()
test <- Boston[1, 1:13]

set.seed(123)
gbm1 <- gbm(medv ~ .,
            distribution = "gaussian",
            n.trees = 5000,
            interaction.depth = 8,
            n.minobsinnode = 1,
            shrinkage = 0.01,
            cv.folds = 5,
            data = Boston)
best.iter <- gbm.perf(gbm1, method="cv", plot=FALSE)

mypredictnew <- function(newdata)
{
  require(gbm)
  predict(gbm1, newdata, best.iter)
}
print(mypredictnew(test))

# Publish the service
ep <- publishWebService(ws = ws, fun = mypredictnew, name = "LixunWebService",
                        inputSchema = test)
# consume
results1 <- consume(ep, test)

RevolutionAnalytics / AzureML

GBM example and possible problem #56