GaryBAYLOR / data_mining

0 stars 0 forks source link

Comparing Naive Bayes and SVM for iris dataset #1

Open GaryBAYLOR opened 7 years ago

GaryBAYLOR commented 7 years ago

Both naive Bayes and support vector machine (SVM) can be used for classification. We compare the performance of them by using a classical data set iris.

Accuracy

We conduct 1000 simulations. For each simulation we divide the data into a training set and a testing set. We use the training set to fit the model and predictions are made for the testing set. Classification error is calculated. After all 1000 simulations, the mean and standard deviations of the classification error is reported.

> nb_svm(iris, B = 1000)
     naiveBayes        svm
mean 0.04688000 0.04222000
std  0.02563831 0.02465927

From simulation we find svm produces 4.2% classification error on average, slightly less than 4.7% by naive Bayes.

Computation time

library(microbenchmark)
> microbenchmark(naiveBayes(iris[,1:4], iris[, 5]), svm(iris[,1:4], iris[, 5]), times = 1000)
Unit: milliseconds
                               expr      min       lq     mean
 naiveBayes(iris[, 1:4], iris[, 5]) 1.302364 1.432913 1.554624
        svm(iris[, 1:4], iris[, 5]) 3.105816 3.317584 3.557402
   median       uq      max neval
 1.492442 1.575168 16.33110  1000
 3.433982 3.572570 22.91197  1000

By 1000 simulations, we find naiveBayes just needs 44% of time needed by SVM.

Therefore, SVM is more accurate than naive Bayes, but needs more computation time. This result is only obtained for dataset iris, which contains 150 observations on 4 predictors. For larger dataset, we expect naiveBayes to be even faster, compared to SVM.

Appendix: R code

## 1. load the library
library(e1071)

## 2. divide the data into a training set and a testing set
n <- nrow(iris)
index <- sample(n, n * 2 / 3)
training <- iris[index, ]
testing <- iris[-index, ]

## 3. naive Bayes model
mod1 <- naiveBayes(training[, 1:4], training[, 5])
pred1 <- predict(mod1, testing[, 1:4])
err1 <- mean(pred1 != testing[, 5])

## 4. SVM
mod2 <- svm(training[, 1:4], training[, 5])
pred2 <- predict(mod2, testing[, 1:4])
err2 <- mean(pred2 != testing[, 5])

## 5. Monte Carlo simulation
nb_svm <- function(data, B = 100, training.p = 2/3) {
    n <- nrow(data)
    err1 <- err2 <- numeric(B)
    for(i in 1:B) {
        index <- sample(n, n * training.p)
        training <- iris[index, ]
        testing <- iris[-index, ]
        mod1 <- naiveBayes(training[, 1:4], training[, 5])
        pred1 <- predict(mod1, testing[, 1:4])
        err1[i] <- mean(pred1 != testing[, 5])
        mod2 <- svm(training[, 1:4], training[, 5])
        pred2 <- predict(mod2, testing[, 1:4])
        err2[i] <- mean(pred2 != testing[, 5])
    }
    plot(density(err1))
    lines(density(err2), col = "red")
    legend("topright", legend = c("naiveBayes", "svm"), lty = c(1, 1), col = c("black", "red"))
    res1 <- c(mean(err1), sd(err1))
    res2 <- c(mean(err2), sd(err2))
    res <- data.frame(naiveBayes = res1, svm = res2)
    rownames(res) <- c("mean", "std")
    res
}
GaryBAYLOR commented 7 years ago

Add the comparison to random Forest

> nb_svm_rf(iris, B = 1000)
     naiveBayes        svm         rf
mean 0.04672000 0.04174000 0.04910000
std  0.02628179 0.02412373 0.02582745

> microbenchmark(naiveBayes(iris[,1:4], iris[, 5]), svm(iris[,1:4], iris[, 5]), randomForest(iris[,1:4], iris[, 5]), times = 500)
Unit: milliseconds
                                 expr       min        lq      mean    median        uq      max neval
   naiveBayes(iris[, 1:4], iris[, 5])  1.158110  1.267659  1.512192  1.390931  1.535771 24.88096   500
          svm(iris[, 1:4], iris[, 5])  2.789305  3.001171  3.470282  3.239149  3.517028 26.12542   500
 randomForest(iris[, 1:4], iris[, 5]) 24.747243 28.424066 31.075315 29.865686 31.300854 64.02756   500

We see that error rate of random Forest is slightly worse than naive Bayes. The worst thing is: random Forest is 9 times slower than SVM, and 20 times slower than naive Bayes.

library(randomForest)
nb_svm_rf <- function(data, B = 100, training.p = 2/3) {
    n <- nrow(data)
    err1 <- err2 <- err3 <- numeric(B)
    for(i in 1:B) {
        index <- sample(n, n * training.p)
        training <- iris[index, ]
        testing <- iris[-index, ]
        mod1 <- naiveBayes(training[, 1:4], training[, 5])
        pred1 <- predict(mod1, testing[, 1:4])
        err1[i] <- mean(pred1 != testing[, 5])
        mod2 <- svm(training[, 1:4], training[, 5])
        pred2 <- predict(mod2, testing[, 1:4])
        err2[i] <- mean(pred2 != testing[, 5])
        mod3 <- randomForest(training[, 1:4], training[, 5])
        pred3 <- predict(mod3, testing[, 1:4])
        err3[i] <- mean(pred3 != testing[, 5])
    }
    plot(density(err1))
    lines(density(err2), col = "red")
    lines(density(err3), col = "blue")
    legend("topright", legend = c("naiveBayes", "svm", "random Forest"), lty = c(1, 1, 1), col = c("black", "red", "blue"))
    res1 <- c(mean(err1), sd(err1))
    res2 <- c(mean(err2), sd(err2))
    res3 <- c(mean(err3), sd(err3))
    res <- data.frame(naiveBayes = res1, svm = res2, rf = res3)
    rownames(res) <- c("mean", "std")
    res
}