conroylau / lpinfer

lpinfer: An R Package for Inference in Linear Programs
GNU General Public License v3.0
3 stars 5 forks source link

Something is wrong with the seed #82

Closed a-torgovitsky closed 4 years ago

a-torgovitsky commented 4 years ago

Something inside fsst, perhaps other routines, is messing with the seed in a fundamental way.

rm(list = ls())

devtools::load_all()
source("../lpinfer/example/dgp_mixedlogit.R")

dgp <- mixedlogit_dgp()
lpm <- lpmodel(A.obs = mixedlogit_Aobs(dgp),
               beta.obs = function(d) mixedlogit_betaobs(d, dgp),
               A.shp = rep(1, nrow(dgp$vdist)),
               beta.shp = 1,
               A.tgt = mixedlogit_Atgt_dfelast(dgp, w2eval = 1, eeval = -1))

set.seed(1)
data <- mixedlogit_draw(dgp, n = 2000)
print(estbounds(data = data, lpm))

set.seed(1)
data <- mixedlogit_draw(dgp, n = 2000)
print(estbounds(data = data, lpm)) # same

print(fsst(data = data, lpm, .5)) # the seed gets messed up in here?

set.seed(1)
data <- mixedlogit_draw(dgp, n = 2000)
print(estbounds(data = data, lpm)) # should be the same, but isn't

Gives this

Loading lpinfer
Estimated bounds: [0.38761, 0.54898] 
Estimated bounds: [0.38761, 0.54898] 
p-value (by data-driven 'lambda'): 0.88                                    
Estimated bounds: [0.47243, 0.47504]

I think we don't want anything in lpinfer to mess with the seed. The user should just set that on their own. I know we talked about this before, but can't remember what we decided. Let me know if I am forgetting an important case.

conroylau commented 4 years ago

I cannot reproduce the exact bounds as above, but I can reproduce the problem where the estimated bounds after running fsst are different from the original ones.

The reason for this is due to the step of extracting .Random.seed. In this step, I first set RNGkind("L'Ecuyer-CMRG") and then extract .Random.seed. The reason for doing so is that the future.seed argument in the future_lapply function only accepts a logical, or an integer (of length 1 or 7), or a list of length(X) with pre-generated random seeds. The default of RNGkind by the user may give a .Random.seed with length that is not equal to either 1 or 7, which would lead to an error if I pass it directly to the future.seed argument of future_lapply. For instance, if the user uses RNGkind("Wichmann-Hill"), then the length of .Random.seed is 4:

RNGkind("Wichmann-Hill")
set.seed(2)
.Random.seed
#> [1] 10400 21758  7530 10264

Thus, I propose the following method in extracting the seed instead:

In this method, I do not need to change RNGkind while making sure that the object passed to the future.seed argument of future_lapply has length either 1 or 7. May I know do you think this is okay? Thanks!

a-torgovitsky commented 4 years ago

I guess this is ok, but it seems like a hack, which means it is probably prone to causing more unforeseen problems, especially since neither of us is an expert on the specifics of parallel programming.

If we want to go this route, wouldn't it be easier just to generate a set of random seeds in serial first? These should be the same each draw if the user has set the seed. Then these can be passed to future_lapply with set.seed. That seems much easier than trying to pull things from the internal .Random.seed.


As I said in #72, I really think it would be a good idea to dig into the discussion about this issue in the future package. This is a complicated issue, but one that comes up in every parallel implementation, since it is essential for reproducibility. There is almost certainly a good solution that does not require messing around with internals in this way.

I looked at their GitHub issues and found many posts (both in their repo and elsewhere) that seem like they might provide clues:

So reading some of these would show us the right way to do things.

conroylau commented 4 years ago

I see. For generating a set of random seeds in serial, do you mean first generating a sequence of numbers in the beginning (say via runif) and then pass each of them to future_lapply as seeds when needed?

I will also read the discussions in the future and related packages to see if there is another way to do it. Thanks!

a-torgovitsky commented 4 years ago

I see. For generating a set of random seeds in serial, do you mean first generating a sequence of numbers in the beginning (say via runif) and then pass each of them to future_lapply as seeds when needed?

Yes exactly

a-torgovitsky commented 4 years ago

Although do they even need to be random numbers? That would be something to check. Maybe they can be deterministic and the user will still get a different result if they change set.seed before running a procedure such as fsst. That would also be ok and much easier.

conroylau commented 4 years ago

Just fixed the problem! After reading various posts, I find that the problem can be solved in a much straightforward way by simply setting future.seed = TRUE without altering the RNGkind as in my previous approach.

Using the updated code, the following result is obtained by running the above code:

Estimated bounds: [0.39067, 0.55121] 
Estimated bounds: [0.39067, 0.55121] 
p-value (by data-driven 'lambda'): 0.88                                                   
Estimated bounds: [0.39067, 0.55121] 

The three set of bounds will still be the same if I set other RNG types. For instance, if I set RNGkind("Wichmann-Hill") before running the code, the following result can be obtained:

Estimated bounds: [0.4576, 0.45883] 
Estimated bounds: [0.4576, 0.45883] 
p-value (by data-driven 'lambda'): 0.55                                                   
Estimated bounds: [0.4576, 0.45883] 

Thanks!

a-torgovitsky commented 4 years ago

That's great! Is there a particular post that you read where this was clear? I would like to check it out to make sure I understand.

conroylau commented 4 years ago

I find the third link that you shared above is one of the most relevant and updated posts. It also mentions about the speed issue, which is quite interesting too.