GaryBAYLOR / R-code

A collection of algorithms written by myself for solving statistical problems
0 stars 0 forks source link

Good Bootstrap, Bad Bootstrap #7

Open GaryBAYLOR opened 9 years ago

GaryBAYLOR commented 9 years ago

Bootstrapping is a method to estimate the distribution of a statistic in a resampling way. It is useful when it is difficult to get the distribution of some complex statistics in other ways. There are two bootstrapping methods: Parametric and Non-parametric. For an i.i.d. sample X1, X2, ..., Xn. Whether good or not to use bootstrapping method is influenced by

(1) The method of bootstrapping: non-parametric or parametric; (2) The distribution of the sample(kurtosis, variance, etc.); (3) The distribution of the hypothesised population(kurtosis, variance, etc).

The following is a R program I make to compare parametric and non-parametric bootstrapping method in learning the mean and variance for a sample from standard normal distribution. An example is as follows. > bootstrap(n=50, seed = 2015) image While some example seems to show that this method is pretty good, other examples may not be so lucky. > bootstrap(n=50, seed = 1) image

The original R code is as follows.

bootstrap <- function(B = 1000, n = 20, mu = 0, sigma = 1, seed = round(runif(1, -1e4, 1e4))) {

    ##  initial sample
    set.seed(seed)
    X <- rnorm(n, mu, sigma)

    ##   parametric distribution
        #    use MLE to estimate the parameters of population
        meanX <- mean(X)
        varX <- var(X) *(n-1)/n

        #    generate bootstrap samples
        meanx <- varx <- numeric(B)
        for (i in 1:B) {
            x <- rnorm(n, meanX, varX)
            meanx[i] <- mean(x)
            varx[i] <- var(x)
        }

        #    plot true and simulated distirbution for mean
        par(mfrow = c(2, 2))
            xlower <- mu - 3.5*sigma/sqrt(n)
            xupper <- mu + 3.5*sigma/sqrt(n)
            xi <- seq(xlower, xupper, length = 200)
            yi <- dnorm(xi, mu, sigma/sqrt(n))
        plot(xi, yi, type = "l", ylim = c(0, max(yi, density(meanx)$y)), xlab = "mean", ylab = "density", main = "Parametric bootstrap for \n mean")        
        lines(density(meanx), type = "l", col = "red")  

        # plot true and simulated distirbution for variance
            vlower <- qgamma(0.025, shape = (n-1)/2, rate = n / (2*sigma^2))
            vupper <- qgamma(0.975, shape = (n-1)/2, rate = n / (2*sigma^2))
            vi <-  seq(min(quantile(varx, .025),vlower), max(vupper, quantile(varx, .975)), length = 200)
            wi <- dgamma(vi, shape = (n-1)/2, rate = n / (2*sigma^2))
            plot(vi, wi, type = "l", ylim = c(0, max(wi, density(varx)$y)), xlab = "mean", ylab = "density", main = "Parametric bootstrap for \n variance")
            lines(density(varx), type = "l", col = "blue")

    ##  non-parametric distribution
    meanx.n <- varx.n <- numeric(B)
    for (i in 1:B) {
            x <- sample(X, replace = TRUE)
            meanx.n[i] <- mean(x)
            varx.n[i] <- var(x)
        }
    plot(xi, yi, type = "l", ylim = c(0, max(yi, density(meanx.n)$y)), xlab = "mean", ylab = "density", main = "Non-parametric bootstrap for \n mean")      
        lines(density(meanx.n), type = "l", col = "red")
    plot(vi, wi, type = "l", ylim = c(0, max(wi, density(varx.n)$y)), xlab = "mean", ylab = "density", main = "Non-parametric bootstrap for \n variance")
            lines(density(varx.n), type = "l", col = "blue")    

}

From many examples I have tried, it seems that the parametric and non-parametric method has almost the same effect on the distribution of mean and variance of normal data. While, this is only the result from normal distribution, which is symmetric. Other distributions may have difference results if they are skewed.