Good Bootstrap, Bad Bootstrap

Bootstrapping is a method to estimate the distribution of a statistic in a resampling way. It is useful when it is difficult to get the distribution of some complex statistics in other ways. There are two bootstrapping methods: Parametric and Non-parametric. For an i.i.d. sample X1, X2, ..., Xn. Whether good or not to use bootstrapping method is influenced by

(1) The method of bootstrapping: non-parametric or parametric; (2) The distribution of the sample(kurtosis, variance, etc.); (3) The distribution of the hypothesised population(kurtosis, variance, etc).

The following is a R program I make to compare parametric and non-parametric bootstrapping method in learning the mean and variance for a sample from standard normal distribution. An example is as follows. > bootstrap(n=50, seed = 2015) While some example seems to show that this method is pretty good, other examples may not be so lucky. > bootstrap(n=50, seed = 1)

The original R code is as follows.

bootstrap <- function(B = 1000, n = 20, mu = 0, sigma = 1, seed = round(runif(1, -1e4, 1e4))) {

    ##  initial sample
    set.seed(seed)
    X <- rnorm(n, mu, sigma)

    ##   parametric distribution
        #    use MLE to estimate the parameters of population
        meanX <- mean(X)
        varX <- var(X) *(n-1)/n

        #    generate bootstrap samples
        meanx <- varx <- numeric(B)
        for (i in 1:B) {
            x <- rnorm(n, meanX, varX)
            meanx[i] <- mean(x)
            varx[i] <- var(x)
        }

        #    plot true and simulated distirbution for mean
        par(mfrow = c(2, 2))
            xlower <- mu - 3.5*sigma/sqrt(n)
            xupper <- mu + 3.5*sigma/sqrt(n)
            xi <- seq(xlower, xupper, length = 200)
            yi <- dnorm(xi, mu, sigma/sqrt(n))
        plot(xi, yi, type = "l", ylim = c(0, max(yi, density(meanx)$y)), xlab = "mean", ylab = "density", main = "Parametric bootstrap for \n mean")        
        lines(density(meanx), type = "l", col = "red")  

        # plot true and simulated distirbution for variance
            vlower <- qgamma(0.025, shape = (n-1)/2, rate = n / (2*sigma^2))
            vupper <- qgamma(0.975, shape = (n-1)/2, rate = n / (2*sigma^2))
            vi <-  seq(min(quantile(varx, .025),vlower), max(vupper, quantile(varx, .975)), length = 200)
            wi <- dgamma(vi, shape = (n-1)/2, rate = n / (2*sigma^2))
            plot(vi, wi, type = "l", ylim = c(0, max(wi, density(varx)$y)), xlab = "mean", ylab = "density", main = "Parametric bootstrap for \n variance")
            lines(density(varx), type = "l", col = "blue")

    ##  non-parametric distribution
    meanx.n <- varx.n <- numeric(B)
    for (i in 1:B) {
            x <- sample(X, replace = TRUE)
            meanx.n[i] <- mean(x)
            varx.n[i] <- var(x)
        }
    plot(xi, yi, type = "l", ylim = c(0, max(yi, density(meanx.n)$y)), xlab = "mean", ylab = "density", main = "Non-parametric bootstrap for \n mean")      
        lines(density(meanx.n), type = "l", col = "red")
    plot(vi, wi, type = "l", ylim = c(0, max(wi, density(varx.n)$y)), xlab = "mean", ylab = "density", main = "Non-parametric bootstrap for \n variance")
            lines(density(varx.n), type = "l", col = "blue")    

}

From many examples I have tried, it seems that the parametric and non-parametric method has almost the same effect on the distribution of mean and variance of normal data. While, this is only the result from normal distribution, which is symmetric. Other distributions may have difference results if they are skewed.

GaryBAYLOR / R-code

Good Bootstrap, Bad Bootstrap #7