Fix bootstrapping code - Githubissues

The bootstrapping code isn't quite working as I expected.

It is not appropriate to simply bootstrap the data frame. To see why, the row in the data frame:

count coverage        mutation
    5      292 aa:orf1a:L1559F
    2      414 aa:orf1a:M3655I

actually comes from something like:

mutation         present
aa:orf1a:L1559F     TRUE # 5 TRUEs
aa:orf1a:L1559F     TRUE
aa:orf1a:L1559F     TRUE
aa:orf1a:L1559F     TRUE
aa:orf1a:L1559F     TRUE
aa:orf1a:L1559F    FALSE # 287 FALSEs
aa:orf1a:L1559F    FALSE
aa:orf1a:L1559F    FALSE
aa:orf1a:L1559F    FALSE
            ...      ...
aa:orf1a:M3655I     TRUE # 4 TRUEs
aa:orf1a:M3655I     TRUE
aa:orf1a:M3655I     TRUE
aa:orf1a:M3655I     TRUE
aa:orf1a:M3655I    FALSE # 410 FALSEs
            ...      ...

Where there were 5 observations of the mutation and 292-5=287 observations without the mutation. This is the version of the data we want to bootstrap, but it's computationally dumb to convert each row into something this big.

In the current code, it assumes that the coverage is fixed and that the count should have a probability of 5/292. This is a binomial distribution, so it randomly samples from a binomial distribution. The coverage is randomly sampled from a Poisson distribution centred at the observed coverage.

We want to randomly sample the count and coverage as if we're taking random rows from the second data frame. In the example above, we have a df with 292+414=706 rows. Randomly re-sampling from these data would always result in 706 rows, but the current implementation does not.

The correct bootstrapping method should either (A) sample randomly from the rows efficiently or (B) apply a probability distribution with the same statistical properties as (A).

Please comment on this issue with some suggestions so we can plan this out together.

Here's some code to demonstrate how bootstrapping should work:

sample_df = data.frame(count = c(2, 3, 100),
    coverage = c(4, 5, 1000), 
    mut = LETTERS[1:3])
sample_df

long_df <- lapply(X = unique(sample_df$mut), 
    FUN = function(x) {
        counts <- sample_df$count[sample_df$mut == x]
        covers <- sample_df$coverage[sample_df$mut == x]
        long_df <- data.frame(mut = x, 
            present = c(rep("yes", counts), 
                rep("no", covers - counts)
            )
        )
        return(long_df)
    }) |> 
    do.call(what = rbind, args = _)
head(long_df, 12)

counts_coverage <- function(df) {
    lapply(unique(df$mut), function(x) {
        count <- sum(df$present[df$mut == x] == "yes")
        coverage <- sum(df$mut == x)
        data.frame(count = count, coverage = coverage, mut = x)
    }) |> do.call(what = rbind, args = _)
}

# Sanity check
all.equal(sample_df, counts_coverage(long_df))

# Bootstrapping properties
boot_samples <- lapply(X = 1:1000, FUN = function(x) {
    boot <- long_df[sample(1:nrow(long_df), nrow(long_df), TRUE),]
    boot <- counts_coverage(boot)
    boot$iter <- x
    boot
}) |> do.call(what = rbind, args = _)
head(boot_samples, 15)

It would be nice if we could, say, sample everything from the Poisson distribution, but there are two issues:

The count has to be less than the coverage.
- Sampling coverage from Poisson and counts from Binomial(count, frequency) fixes this problem.
We need to make sure that the Poisson distribution actually works!!!

For the example above, the Poisson does not work:

par(mfrow = c(2, 3))
for (col in c("count", "coverage")) {
    for (mut in sort(unique(boot_samples$mut))) {
        colmut <- boot_samples[boot_samples$mut == mut, col]
        x_vals <- sort(unique(colmut))
        hist(colmut,
            breaks = seq(min(x_vals) - 0.5, max(x_vals) + 0.5, 1),
            main = paste(col, mut, collapse = ", "))

        true_mean = sample_df[sample_df$mut == mut, col]
        pois_est <- dpois(x_vals, lambda = true_mean) * 1000
        lines(x_vals - 0.5, pois_est, type = "s", col = "red")
        lines(x_vals - 0.5, pois_est, type = "h", col = "red")
    }
}

In particular, a Poisson distribution does not describe "Coverage C", the coverage for mutation C. From testing, the reason for this is that the Poisson's variance is much too large (recall that the variance is equal to the mean for the Poisson distribution, which in this example is 1000).

I think the correct approach is to use a multinomial distribution to determine the coverages, then the binomial for each mutation (I checked others' code, and this is indeed what they do.

As a sanity check (using the same data frame as above):


head(sample_df)
get_bootstrap_df <- function(df, n_resamples = 100) {
    new_coverage <- rmultinom(n_resamples, 
            size = sum(df$coverage),
            prob = df$coverage / sum(df$coverage)) |>
        as.numeric()
    new_counts <- rbinom(length(new_coverage),
            size = new_coverage,
            prob = rep(df$count/df$coverage, n_resamples))
    return(data.frame(count = new_counts,
        coverage = new_coverage,
        mut = rep(df$mu, n_resamples),
        iteration = rep(1:n_resamples, each = nrow(df))))
}

bs1 <- get_bootstrap_df(sample_df, 1000)

par(mfrow = c(2, 3))
for (col in c("count", "coverage")) {
    for (mut in sort(unique(boot_samples$mut))) {
        colmut <- boot_samples[boot_samples$mut == mut, col]
        x_vals <- sort(unique(colmut))
        hist(colmut,
            breaks = seq(min(x_vals) - 0.5, max(x_vals) + 0.5, 1),
            col = rgb(0, 0, 0, 0.2),
            main = paste(col, mut, collapse = ", "))

        hist(bs1[bs1$mut == mut, col],
            breaks = seq(min(unique(bs1$coverage)) - 0.5, 
                max(unique(bs1$coverage)) + 0.5, 
                1),
            col = rgb(1, 0, 0, 0.2),
            add = TRUE)
    }
}

And yes, they match!!!

So the next step is to implement this bootstrapping approach and make it work with theprovoc function and return a bootstrap CI. I set it up so that the iteration is included, so we can use the by argument in @danerkestey's update to the function.

DASL-Lab / provoc

Fix bootstrapping code #11