Closed a-torgovitsky closed 4 years ago
I cannot reproduce the exact bounds as above, but I can reproduce the problem where the estimated bounds after running fsst
are different from the original ones.
The reason for this is due to the step of extracting .Random.seed
.
In this step, I first set RNGkind("L'Ecuyer-CMRG")
and then extract .Random.seed
.
The reason for doing so is that the future.seed
argument in the future_lapply
function only accepts a logical
, or an integer
(of length 1 or 7), or a list
of length(X) with pre-generated random seeds.
The default of RNGkind
by the user may give a .Random.seed
with length that is not equal to either 1 or 7, which would lead to an error if I pass it directly to the future.seed
argument of future_lapply
.
For instance, if the user uses RNGkind("Wichmann-Hill")
, then the length of .Random.seed
is 4:
RNGkind("Wichmann-Hill")
set.seed(2)
.Random.seed
#> [1] 10400 21758 7530 10264
Thus, I propose the following method in extracting the seed instead:
.Random.seed
is 1 or 7, directly pass it to the future.seed
argument of future_lapply
..Random.seed
vector to the future.seed
argument of future_lapply
.In this method, I do not need to change RNGkind
while making sure that the object passed to the future.seed
argument of future_lapply
has length either 1 or 7.
May I know do you think this is okay? Thanks!
I guess this is ok, but it seems like a hack, which means it is probably prone to causing more unforeseen problems, especially since neither of us is an expert on the specifics of parallel programming.
If we want to go this route, wouldn't it be easier just to generate a set of random seeds in serial first? These should be the same each draw if the user has set the seed. Then these can be passed to future_lapply
with set.seed
. That seems much easier than trying to pull things from the internal .Random.seed
.
As I said in #72, I really think it would be a good idea to dig into the discussion about this issue in the future
package. This is a complicated issue, but one that comes up in every parallel implementation, since it is essential for reproducibility. There is almost certainly a good solution that does not require messing around with internals in this way.
I looked at their GitHub issues and found many posts (both in their repo and elsewhere) that seem like they might provide clues:
So reading some of these would show us the right way to do things.
I see. For generating a set of random seeds in serial, do you mean first generating a sequence of numbers in the beginning (say via runif
) and then pass each of them to future_lapply
as seeds when needed?
I will also read the discussions in the future
and related packages to see if there is another way to do it. Thanks!
I see. For generating a set of random seeds in serial, do you mean first generating a sequence of numbers in the beginning (say via runif) and then pass each of them to future_lapply as seeds when needed?
Yes exactly
Although do they even need to be random numbers?
That would be something to check. Maybe they can be deterministic and the user will still get a different result if they change set.seed
before running a procedure such as fsst
. That would also be ok and much easier.
Just fixed the problem! After reading various posts, I find that the problem can be solved in a much straightforward way by simply setting future.seed = TRUE
without altering the RNGkind
as in my previous approach.
Using the updated code, the following result is obtained by running the above code:
Estimated bounds: [0.39067, 0.55121]
Estimated bounds: [0.39067, 0.55121]
p-value (by data-driven 'lambda'): 0.88
Estimated bounds: [0.39067, 0.55121]
The three set of bounds will still be the same if I set other RNG types. For instance, if I set RNGkind("Wichmann-Hill")
before running the code, the following result can be obtained:
Estimated bounds: [0.4576, 0.45883]
Estimated bounds: [0.4576, 0.45883]
p-value (by data-driven 'lambda'): 0.55
Estimated bounds: [0.4576, 0.45883]
Thanks!
That's great! Is there a particular post that you read where this was clear? I would like to check it out to make sure I understand.
I find the third link that you shared above is one of the most relevant and updated posts. It also mentions about the speed issue, which is quite interesting too.
Something inside
fsst
, perhaps other routines, is messing with the seed in a fundamental way.Gives this
I think we don't want anything in
lpinfer
to mess with the seed. The user should just set that on their own. I know we talked about this before, but can't remember what we decided. Let me know if I am forgetting an important case.