Open GaryBAYLOR opened 7 years ago
Add the comparison to random Forest
> nb_svm_rf(iris, B = 1000)
naiveBayes svm rf
mean 0.04672000 0.04174000 0.04910000
std 0.02628179 0.02412373 0.02582745
> microbenchmark(naiveBayes(iris[,1:4], iris[, 5]), svm(iris[,1:4], iris[, 5]), randomForest(iris[,1:4], iris[, 5]), times = 500)
Unit: milliseconds
expr min lq mean median uq max neval
naiveBayes(iris[, 1:4], iris[, 5]) 1.158110 1.267659 1.512192 1.390931 1.535771 24.88096 500
svm(iris[, 1:4], iris[, 5]) 2.789305 3.001171 3.470282 3.239149 3.517028 26.12542 500
randomForest(iris[, 1:4], iris[, 5]) 24.747243 28.424066 31.075315 29.865686 31.300854 64.02756 500
We see that error rate of random Forest is slightly worse than naive Bayes. The worst thing is: random Forest is 9 times slower than SVM, and 20 times slower than naive Bayes.
library(randomForest)
nb_svm_rf <- function(data, B = 100, training.p = 2/3) {
n <- nrow(data)
err1 <- err2 <- err3 <- numeric(B)
for(i in 1:B) {
index <- sample(n, n * training.p)
training <- iris[index, ]
testing <- iris[-index, ]
mod1 <- naiveBayes(training[, 1:4], training[, 5])
pred1 <- predict(mod1, testing[, 1:4])
err1[i] <- mean(pred1 != testing[, 5])
mod2 <- svm(training[, 1:4], training[, 5])
pred2 <- predict(mod2, testing[, 1:4])
err2[i] <- mean(pred2 != testing[, 5])
mod3 <- randomForest(training[, 1:4], training[, 5])
pred3 <- predict(mod3, testing[, 1:4])
err3[i] <- mean(pred3 != testing[, 5])
}
plot(density(err1))
lines(density(err2), col = "red")
lines(density(err3), col = "blue")
legend("topright", legend = c("naiveBayes", "svm", "random Forest"), lty = c(1, 1, 1), col = c("black", "red", "blue"))
res1 <- c(mean(err1), sd(err1))
res2 <- c(mean(err2), sd(err2))
res3 <- c(mean(err3), sd(err3))
res <- data.frame(naiveBayes = res1, svm = res2, rf = res3)
rownames(res) <- c("mean", "std")
res
}
Both naive Bayes and support vector machine (SVM) can be used for classification. We compare the performance of them by using a classical data set
iris
.Accuracy
We conduct 1000 simulations. For each simulation we divide the data into a training set and a testing set. We use the training set to fit the model and predictions are made for the testing set. Classification error is calculated. After all 1000 simulations, the mean and standard deviations of the classification error is reported.
From simulation we find svm produces 4.2% classification error on average, slightly less than 4.7% by naive Bayes.
Computation time
By 1000 simulations, we find naiveBayes just needs 44% of time needed by SVM.
Therefore, SVM is more accurate than naive Bayes, but needs more computation time. This result is only obtained for dataset
iris
, which contains 150 observations on 4 predictors. For larger dataset, we expect naiveBayes to be even faster, compared to SVM.Appendix: R code