Bootstrap draws of each iteration in `invertci`

conroylau commented 4 years ago

In the current design of invertci, it is calling the testing procedure in each iteration by do.call. Although the user will set the seed before calling invertci, I think each iteration in the bisection method is using different bootstrap data to compute the p-value. Hence, the result obtained might be inaccurate. My solution to the problem is as follows:

Check if there is any components being a function inside lpmodel.
If yes, create R bootstrap replications for the stochastic component that is a function.
Save these bootstrap replications as a list and replace the corresponding function in the lpmodel object by the new list.
If the component in the lpmodel object is a list or is deterministic, then there is nothing to do.
In each iteration of the bisection method, we will be using the same bootstrap components of the lpmodel object via the new list.

A potential problem that I can think of is:

if I am evaluating the stochastic component(s) of the lpmodel in advance, then I do not know whether there will be some problematic bootstrap draws.
For instance, there could be some bootstrap replications of beta.obs that is returning an error in the optimization problems.

Hence, I am thinking instead of evaluating all beta.obs before running the procedure, I can return the non-problematic bootstrap beta.obs draws from the testing procedures. In this way, I can collect them after running the procedure for the first time and create a list of bootstrap beta.obs. The list of beta.obs returned should also be non-problematic for the subsequent iterations in the invertci procedure because the only thing that changes is beta.tgt. As long as the beta.tgt quantity is within the logical bounds, there should not be any problems.

I am also thinking of adding an option like return.stochastic.lpm in the procedures, where setting the option as TRUE means returning the list of non-problematic bootstrap draws for the component in the lpmodel object that is a function. This is because the list returned can be large and use up a lot of memory, so I will set it as FALSE as default.

a-torgovitsky commented 4 years ago

This is an interesting point. Your solution makes sense, but I think it's too elaborate and will cause difficulty down the line. In particular, I am not sure this is true!

The list of beta.obs returned should also be non-problematic for the subsequent iterations in the invertci procedure because the only thing that changes is beta.tgt. As long as the beta.tgt quantity is within the logical bounds, there should not be any problems.

In testing the fsst procedure for example I have found that sometimes a draw that is problematic under one value of beta.tgt is not problematic under another. This is not unexpected since the linear program is all one piece. Who knows what Gurobi is doing to it under the hood.

Generally we want to avoid setting the seed ourselves. But maybe this is one case where it is necessary?

A simple solution to the problem would be to add an optional parameter seed = NULL to invertci.

If it's NULL then we don't do anything and just live with the problem that the CI might be strange due to randomness in the bootstrap draws. If it's an integer, then we call set.seed(seed) before each iteration of do.call. This will ensure the same bootstrap draws are used each iteration.

What do you think?

conroylau commented 4 years ago

I see. I thought that the problem always come from the other constraints (such as the ones corresponding to beta.obs). Hence, I have made that statement in the post. So I agree it is actually not a good idea to evaluate all the stochastic components in advance because it might create another source of error if some beta.tgt are causing troubles.

Actually, I find a better solution to solve the problem:

Get the RNG state set by the user by .Random.seed in the beginning of the invertci procedure.
Before each iteration of do.call, set the seed as the RNG state.
With the same RNG state, we will be using the same bootstrap data in each iteration.

This is illustrated by the following:

set.seed(1)
a <- .Random.seed
set.seed(a)
runif(5)
#> [1]  0.7330424 0.8172787 0.8308336 0.6205377 0.1086965
set.seed(a)
runif(5)
#> [1]. 0.7330424 0.8172787 0.8308336 0.6205377 0.1086965

It seems to also work well when future_lapply is used:

set.seed(a)
unlist(future.apply::future_lapply(1:2, FUN = runif, future.seed = TRUE))
#> [1] 0.03632143 0.70247464 0.72824333

set.seed(a)
unlist(future.apply::future_lapply(1:2, FUN = runif, future.seed = TRUE))
#> [1] 0.03632143 0.70247464 0.72824333

In this way, we can avoid asking the seed from the user as in the other procedures. Do you think this is fine? Thanks!

a-torgovitsky commented 4 years ago

I think that's a good solution as long as it is safe and doesn't have any unintended consequences.

So if the user does

runif(1)
invertci(...)
runif(1) # this should be different than the first one

and

set.seed(1)
runif(1)
invertci(...)
set.seed(1)
runif(1) # this should be the same as the first one

We had a bad experience with seeds in ivmte https://github.com/jkcshea/ivmte/issues/187 https://github.com/jkcshea/ivmte/pull/188

So I want to make sure we don't have something similar going on here

conroylau commented 4 years ago

I see. In the following sample code, the behavior is aligned with what you expect above. In the invertci_toy function below, I include the procedure of extracting the seed and setting the seed before each iteration, just like what I intend to do in the invertci function.

invertci_toy <- function(iter) {
  # Obtain the RNG state in the beginning
  s <- .Random.seed
  x <- NULL
  # Iterate (like the bisection method)
  for (i in 1:iter) {
    # Set seed before each iteration
    assign(x = ".Random.seed", value = s, envir = .GlobalEnv)
    x <- c(x, runif(1))
  }
  print(x)
}

Then, the following code:

print(runif(1))
invertci_toy(5)
print(runif(1))

gives the following output, where

the first line is different from the third line,
every random numbers on the second line are the same (representing the same bootstrap data is used in invertci).
```
[1] 0.3721239
[1] 0.5728534 0.5728534 0.5728534 0.5728534 0.5728534
[1] 0.9082078
```

On the other hand, the following code:

set.seed(1)
print(runif(1))
invertci_toy(5)
set.seed(1)
print(runif(1))

gives the following output, where

the first line and the third line are the same,

every random numbers on the second line are the same too.

[1] 0.2655087
[1] 0.3721239 0.3721239 0.3721239 0.3721239 0.3721239
[1] 0.2655087

In addition, by setting seed before invertci_toy, we can get reproducible results. For instance, the following code

set.seed(12345)
invertci_toy(5)

runif(5) # Run something else

set.seed(12345)
invertci_toy(5)

gives:

[1] 0.7209039 0.7209039 0.7209039 0.7209039 0.7209039
[1] 0.7209039 0.7209039 0.7209039 0.7209039 0.7209039

So it is not messing up the seed as in #82 because I am not changing the RNGkind.

a-torgovitsky commented 4 years ago

Ok, I'm convinced! Looks like a great solution.

conroylau commented 4 years ago

Great! I will go ahead and implement this method and see it works well as expected. Thanks!

conroylau commented 4 years ago

Done! Just updated the invertci procedure.

I have ran some tests as follows to make sure it is not messing up with the seed. The set-up of the test is as follows (basically same as in #85).

devtools::load_all()
source("../lpinfer/example/dgp_mixedlogit.R")

dgp <- mixedlogit_dgp()
set.seed(1)
data <- mixedlogit_draw(dgp, n = 2000)
lpm <- lpmodel(A.obs = mixedlogit_Aobs(dgp),
               beta.obs = function(d) mixedlogit_betaobs(d, dgp),
               A.shp = rep(1, nrow(dgp$vdist)),
               beta.shp = 1,
               A.tgt = mixedlogit_Atgt_dfelast(dgp, w2eval = 1, eeval = -1))
invertci.arg <- list(f = fsst,
                     farg = list(lpm = lpm,
                                 data = data,
                                 R = 200),
                     lb0 = 0,
                     ub0 = 1,
                     progress = FALSE)

The result is reproducible:

set.seed(1)
a <- do.call(invertci, invertci.arg)
print(a)
#> Confidence interval: [0.07632, 0.83835]
set.seed(1)
b <- do.call(invertci, invertci.arg)
print(b)
#> Confidence interval: [0.07632, 0.83835]

The behavior of the random number generated looks like it is not messing up with the seed.
- p and r are the same with set.seed(1) when they are generated before and after invertci.
```
set.seed(1)
p <- runif(1)
print(p)
#> [1] 0.2655087
q <- do.call(invertci, invertci.arg)
set.seed(1)
r <- runif(1)
print(r)
#> [1] 0.2655087
```

x and z are different without setting the seed when they are generated before and after invertci.

x <- runif(1)
print(x)
#> [1] 0.3721239
y <- do.call(invertci, invertci.arg)
z <- runif(1)
print(z)
#> [1] 0.754191

The p-value for a particular test point is the same as if running it outside the invertci procedure with the same seed. For instance, based on the details extracted from the summary of a, the p-value when beta.tgt = .09375 is 0.035. (I am skipping many rows as the table is having too many rows).

Iteration       Lower bound   Upper bound   Test point   p-value   Reject?
Left end pt.        0.00000            NA      0.00000   0.00000      TRUE
Right end pt.            NA       1.00000      1.00000   0.00000      TRUE
...
5                   0.06250       0.12500      0.09375   0.03500     FALSE
...

Running the fsst procedure directly yields the same result:

set.seed(1)
fsst(data, lpm, beta.tgt = 0.09375, R = 200)
#> p-value (by data-driven 'lambda'): 0.035

a-torgovitsky commented 4 years ago

Great work!

conroylau / lpinfer

Bootstrap draws of each iteration in `invertci` #88