conroylau / lpinfer

lpinfer: An R Package for Inference in Linear Programs
GNU General Public License v3.0
3 stars 5 forks source link

Probably we don't (and should not) set the seed within any of our functions #16

Closed a-torgovitsky closed 4 years ago

a-torgovitsky commented 4 years ago
set.seed(3)
dkqs(...)

seems like a more sensible way to do things, and then our functions are better specialized and have fewer inputs.

Please confirm that setting the seed before running something like dkqs gives the same results as the current ones where the seed is set inside dkqs

conroylau commented 4 years ago

Done! bs_seed is now removed from the module.

I have made a few changes other than only removing the argument bs_seed from the function. Here are some details about the modification:

Originally, I set the seed inside the for-loop that does the bootstrap, where the seed is dependent on the loop number. The reason that I set it to be dependent on the loop number is because in parallel programming, the following sample code is not reproducible:

library(foreach)
set.seed(1)
x = foreach(i = 1:10) %dopar% {
  x = rand(1)
}

while the following code is reproducible:

x = foreach(i = 1:10) %dopar% {
  set.seed(i)
  x = rand(1)
}

It has to be dependent on the loop number or varying, because if I fix it to a certain number, all draws will be the same:

x = foreach(i = 1:10) %dopar% {
  set.seed(1)
  x = rand(1)
}

To help our functions to be better specialized and have fewer inputs, I find that there is a useful package doRnG which can help to generate reproducible results in parallel programming with set.seed() being placed outside the for-loop. Thus, the first change that I made is to change %dopar% to %dorng%.

Then, I can set the seed to somewhere outside the for-loop that does the bootstrap in both the parallel programming and the usual case. In this way, I can also compare whether setting the seed in the beginning of the function or outside the function gives the same results.

I have checked that they give the same results for both the parallel programming case and the usual case. Here is an example that contains the p-value for the dkqs function for the first 100 seeds, which shows that the result is the same if we drop the bs_seed in the function and ask the user to set the seed outside of the function.

a-torgovitsky commented 4 years ago

Great!

Could you also check that %dorng% is roughly as fast as %dopar%?

conroylau commented 4 years ago

I just tested the computational time using the dkqs procedure. I did 1,000 simulations and have 1,000 bootstrap replications for each simulation. I used 8 cores.

The %dorng% version takes 35.87693 mins and the %dopar% version takes 33.83579 mins. It looks like their performance is similar but %dorng% is slower.

a-torgovitsky commented 4 years ago

Ok, here's a better way to benchmark though. Try a faster computation (reduce the number of simlutions and/or bootstraps). Then repeat it many times and time how long it takes each repetition for each method. You will see that there is a distribution.

By just doing it once you are getting only a single draw from the distribution.

conroylau commented 4 years ago

I see. Do you mean that I can try to do maybe 50 simulations and 50 bootstraps (which takes around 10 seconds to complete), then record the time, repeat this procedure for N times and see the distribution of the time taken? Thanks!

a-torgovitsky commented 4 years ago

Yes exactly. This is because there is often a surprising amount of variability in how long computations take (due to other processes running, etc.)

conroylau commented 4 years ago

I see, this is true because I am running other applications (such as Stata) on my computer at the same time so the difference in the time that I have posted earlier might be affected by some other applications instead of being the the differences in the two commands.

I will do the simulation accordingly to observe the distribution.

conroylau commented 4 years ago

Just completed the simulation.

For both dorng and dopar, I conduct 200 replications, where each replication consists of 50 bootstraps and 50 simulations using the dkqs function. I used 8 cores in my machine.

The result is as follows - where it looks like the method of using dorng and setting the seed outside dkqs is performing better than using dopar and setting the seed inside dkqs. There is an outlier in dorng which takes 50 seconds though, but I guess the long computational time is unrelated to the dorng. image

a-torgovitsky commented 4 years ago

That looks great! So there's no reason not to prefer %dorng% -- perfect!

a-torgovitsky commented 4 years ago

Is this complete?

conroylau commented 4 years ago

Yes, its done!