GaryBAYLOR / R-code

A collection of algorithms written by myself for solving statistical problems
0 stars 0 forks source link

SVM and logistic regression in simple case #15

Open GaryBAYLOR opened 8 years ago

GaryBAYLOR commented 8 years ago

We compare the performance of SVM and logistic regression by generating samples using mvrnorm function in R. The following is the R code.

library(MASS)

##  generate data
mu1 <- c(0, 0, 0)
mu2 <- c(2.5,2.5, 2.5)
Sigma1 <- matrix(c(2, 1, 1, 1, 2, 1, 1, 1, 3), nrow = 3)
Sigma2 <- matrix(c(3, 1, 1, 1, 3, 1, 1, 1, 2), nrow = 3)
data1 <- mvrnorm(1000, mu1, Sigma1)
data2 <- mvrnorm(1000, mu2, Sigma2)
y <- rep(c(1, 0), each = 1000)
data <- cbind(rbind(data1, data2),y)
data <- as.data.frame(data)
names(data) <- c("x1", "x2", "x3", "y")

##  glm model
model <- glm(as.factor(y) ~ x1 + x2 + x3, data = data, family = "binomial")
pred <- predict(model, data = data, type = "response")
pred <- as.numeric(pred >= .5)
# ggplot(data = data, aes(x = x1, y = x2)) + geom_point(aes(color = as.factor(pred)))

## svm model
tuneResult <- tune(svm, as.factor(y) ~ x1 + x2 + x3, data = data,
                   ranges = list(epsilon = seq(.1, .9, .2), cost = 1:5)
                   )
pred.svm <- predict(tuneResult$best.model, data = data)

##  compare two models
xtabs(~y + pred)
xtabs(~y + pred.svm)
error.glm <- sum(y != pred)/nrow(data)
error.svm <- sum(y != pred.svm)/nrow(data)
paste("The error rate of logistic regression is: %", error.glm * 100, sep = "")
paste("The error rate of svm is: %", error.svm * 100, sep = "")

The performance is as follows.

> xtabs(~y + pred)
   pred
y     0   1
  0 847 153
  1 146 854
> xtabs(~y + pred.svm)
   pred.svm
y     0   1
  0 865 135
  1 129 871
> paste("The error rate of logistic regression is: %", error.glm * 100, sep = "")
[1] "The error rate of logistic regression is: %14.95"
> paste("The error rate of svm is: %", error.svm * 100, sep = "")
[1] "The error rate of svm is: %13.2"

Another simulation result.

> xtabs(~y + pred)
   pred
y     0   1
  0 830 170
  1 170 830
> xtabs(~y + pred.svm)
   pred.svm
y     0   1
  0 816 184
  1 124 876
> paste("The error rate of logistic regression is: %", error.glm * 100, sep = "")
[1] "The error rate of logistic regression is: %17"
> paste("The error rate of svm is: %", error.svm * 100, sep = "")
[1] "The error rate of svm is: %15.4"

My finding is that:

In very simple case, the two methods are almost equivalent. When the data complexity increases, svm becomes slightly better than logistic regression in many cases, but svm takes much more time than logistic regression.