Closed a-torgovitsky closed 4 years ago
Sorry for the delay in getting back to you on this as I am trying to understand how to use the future
package properly.
From the research that I have done so far, I think using future
is a good idea.
In addition to the two advantages that you have mentioned, it seems to be saving time if the users are running the tests in the lpinfer
package for a few times in parallel because they only need to register the number of cores in the beginning, whereas in the current version, the number of cores are registered inside the function each time when they call the test.
A common problem of running the loops in parallel in future
is similar to the current method of parallelization.
The problem is that I need to first compute the bootstrap beta.obs
outside the parallel loop.
This is illustrated by the following code where the two loops give different results although I set the seed as 1 in both cases.
> plan(multisession, workers = 1)
> set.seed(1)
> r <- listenv::listenv()
> for (i in 1:10) {
+ r[[i]] %<-% {
+ runif(1)
+ }
+ }
> unlist(as.list(r))
[1] 0.26550866 0.37212390 0.57285336 0.90820779 0.20168193 0.89838968 0.94467527 0.66079779 0.62911404 0.06178627
>
> plan(multisession, workers = 8)
> set.seed(1)
> r <- listenv::listenv()
> for (i in 1:10) {
+ r[[i]] %<-% {
+ runif(1)
+ }
+ }
> unlist(as.list(r))
[1] 0.23093935 0.77319639 0.28942275 0.65772145 0.83080366 0.55169474 0.08890901 0.80429786 0.06902059 0.11184654
I am using multisession
instead of multicore
because multicore
is not available in Windows or RStudio.
Another problem is that there are fewer options of using progress bars when I use the future
package.
To test the future
package, I have revised the subsample
function by using the future
package for the loops.
The code can be found here.
On the other hand, I am investigating what is the most appropriate way to set up the loops.
From the examples that I read about the future
package, they suggest the for-loops should be done in the following way (where R
refers to the number of replications):
r <- listenv::listenv()
for (i in 1:R) {
r[[i]] %<-% {
"Some computation here"
}
}
However, I find that if the computation in each bootstrap replication is taking a very short time (say evaluating the solution to a LP using Gurobi), then the overall time used will be longer than not using any parallelization by future
.
Therefore, the way I have done it in the subsample
code is to break the R
bootstrap replications into J
tasks where each of the J
tasks contains I
replications (the last task may contain more than I
replications to make the total number of tasks equals to J
).
This is illustrated in the following pseudo code:
r <- listenv::listenv()
for (j in 1:J) {
r[[j]] %<-% {
rr <- list()
for (i in iseq[[j]]) {
rr[[i]] <- "Some computation here"
}
rr
}
}
In the sample code for the revised subsample
test, I set I
as 400.
On the other hand, I also tried setting up the loop as the followings:
r <- listenv::listenv()
r[[1]] %<-% {
rr <- list()
for (i in 1:R) {
rr[[i]] <- "Some computation here"
}
rr
}
It seems that the above will not be run as parallel because the distribution of the computational time by the above loop for 1 core and 8 cores are nearly the same.
Below is the distribution of the time used when I run the subsampling test 100 times with 8 cores and 2,000 replications using future
and running it 100 times without parallelization.
I am using your data and code in issue #50 as the input for lpmodel
because I want to test if the new version can capture the errors.
It can be seen that the time from using the future
package is less than the current version except for the first time I run it after registering the number of workers to be used (which corresponds to the outlier) and verifies the additional advantage that I mentioned in the beginning.
However, the time saved is not that significant, so I am trying to play with the I
parameter to see how it affects the time used.
In addition, they are done by the multisession
option instead of the multicore
option because I was using RStudio.
Therefore, I think it is a good idea to use the future
package for parallelization.
However, I still need to check what is the most efficient way to write the loops.
I also need to check if there is another way to implement a progress bar when the code consists of several parts (like fsst
) because the current progress bar is rather restrictive.
The problem is that I need to first compute the bootstrap
beta.obs
outside the parallel loop. This is illustrated by the following code where the two loops give different results although I set the seed as 1 in both cases.> plan(multisession, workers = 1) > set.seed(1) > r <- listenv::listenv() > for (i in 1:10) { + r[[i]] %<-% { + runif(1) + } + } > unlist(as.list(r)) [1] 0.26550866 0.37212390 0.57285336 0.90820779 0.20168193 0.89838968 0.94467527 0.66079779 0.62911404 0.06178627 > > plan(multisession, workers = 8) > set.seed(1) > r <- listenv::listenv() > for (i in 1:10) { + r[[i]] %<-% { + runif(1) + } + } > unlist(as.list(r)) [1] 0.23093935 0.77319639 0.28942275 0.65772145 0.83080366 0.55169474 0.08890901 0.80429786 0.06902059 0.11184654
I usually do something like this:
plan(multisession, workers = 8)
base_seed <- 1
r <- listenv::listenv()
for (i in 1:10) {
r[[i]] %<-% {
set.seed(base_seed + i)
runif(1)
}
}
unlist(as.list(r))
Would that help solve this problem?
Another problem is that there are fewer options of using progress bars when I use the
future
package.The progress bars are nice, but it's not essential. Functionality is more important I think.
To test the
future
package, I have revised thesubsample
function by using thefuture
package for the loops. The code can be found here.On the other hand, I am investigating what is the most appropriate way to set up the loops. From the examples that I read about the
future
package, they suggest the for-loops should be done in the following way (whereR
refers to the number of replications):r <- listenv::listenv() for (i in 1:R) { r[[i]] %<-% { "Some computation here" } }
Have you looked at this? https://cran.r-project.org/web/packages/future.apply/index.html
However, I find that if the computation in each bootstrap replication is taking a very short time (say evaluating the solution to a LP using Gurobi), then the overall time used will be longer than not using any parallelization by
future
.
I am very surprised by that. The overhead from parallelization should be tiny compared to solving an LP through Gurobi.
I wonder are you spawning new Gurobi objects for each worker?
I think something like that needs to be done.
You should be seeing something close to linear speedups (using parallel
, future
, or anything else) when parallelizing the bootstrap. That means 4 cores = almost 4x faster than 1 core.
Therefore, the way I have done it in the
subsample
code is to break theR
bootstrap replications intoJ
tasks where each of theJ
tasks containsI
replications (the last task may contain more thanI
replications to make the total number of tasks equals toJ
). This is illustrated in the following pseudo code:r <- listenv::listenv() for (j in 1:J) { r[[j]] %<-% { rr <- list() for (i in iseq[[j]]) { rr[[i]] <- "Some computation here" } rr } }
In the sample code for the revised
subsample
test, I setI
as 400.
This seems error-prone. I would recommend against it.
On the other hand, I also tried setting up the loop as the followings:
r <- listenv::listenv() r[[1]] %<-% { rr <- list() for (i in 1:R) { rr[[i]] <- "Some computation here" } rr }
It seems that the above will not be run as parallel because the distribution of the computational time by the above loop for 1 core and 8 cores are nearly the same.
Again, I don't think that makes sense. Parallelization is used effectively for much less costly computations than solving a linear program. You may want to try profiling the code to check as well.
Below is the distribution of the time used when I run the subsampling test 100 times with 8 cores and 2,000 replications using
future
and running it 100 times without parallelization. I am using your data and code in issue #50 as the input forlpmodel
because I want to test if the new version can capture the errors.It can be seen that the time from using the
future
package is less than the current version except for the first time I run it after registering the number of workers to be used (which corresponds to the outlier) and verifies the additional advantage that I mentioned in the beginning. However, the time saved is not that significant, so I am trying to play with theI
parameter to see how it affects the time used.In addition, they are done by the
multisession
option instead of themulticore
option because I was using RStudio.Therefore, I think it is a good idea to use the
future
package for parallelization. However, I still need to check what is the most efficient way to write the loops. I also need to check if there is another way to implement a progress bar when the code consists of several parts (likefsst
) because the current progress bar is rather restrictive.
I will look into the package you mentioned and look into the code where I apply the parallelization.
Regarding the seed issue, I think that helps to solve the problem.
I originally wrote the tests in a form like that, but I thought we want the seed to be set outside the functions for the tests in the package, so that we can drop the input like base_seed
in our package.
Do you think I should incorporate the base_seed
option back to the package so that it can solve the issue regarding the random numbers? Thanks!
Yes that's a good point, I suppose this conflicts with that. Is it possible for the user to easily work around this by setting the seed on each worker before running our code?
Hm I am not sure about this. I will look into that to see if it is possible.
So far, I cannot find a way of incorporating the seed into the future
package so that the random numbers generated via future_lapply
are the same as the random numbers obtained from a for-loop without parallelization.
However, I have think of a workaround to the problem.
My idea is as follows:
.Random.seed
.future_lapply
function to conduct the bootstrap replications (for each bootstrap replication, use the index to obtain the bootstrap beta.obs
from the bootstrap data and then solve the linear/quadratic programs).Below is a toy example for the above idea:
toy.example <- function(R, data) {
# Step 1: Get the current state of the RNG
seed <- .Random.seed
# Some other computations or functions here, e.g. checking input
# or calculating the parameters here
# Step 2: Generate the list of indices and set the state to make sure it is the same
# as the beginning state provided by the user
id <- list()
.Random.seed <- seed
for (i in 1:R) {
id[[i]] <- sample(data, replace = TRUE)
}
# Step 3: Define the function for resampling and calculating the mean and pass to `future_lapply`
resample_mean <- function(id, data) {
data.bs <- data[id]
return(mean(data.bs))
}
# Compute the mean by future_lapply
means <- future.apply::future_lapply(id, FUN = resample_mean, data = data)
return(unlist(means))
}
If the data is just a vector of integers from 1 to 1000:
data <- 1:1000
Then, the following code
set.seed(1)
mus1 <- NULL
for (i in 1:10) {
data.bs <- sample(data, replace = TRUE)
mus1 <- c(mus1, mean(data.bs))
}
and the following code
library(future)
plan(multisession, workers = 3)
set.seed(1)
mus2 <- toy.example(10, data)
yield the same result.
I am adding step 1 in the procedure just to make sure that the state of the RNG when I draw the indices is the same as the state after the user set the seed.
In addition, running step 2 without parallelization does not take a very long time either. For instance, drawing the resampled indices of a data with 10,000 observations for 1,000 times take less than 0.5 seconds on average without parallelization.
Therefore, may I know do you think the above workaround makes sense?
On the other hand, there is an option of future.seed
in the future_lapply
function.
But setting it as future.seed = k
does not give the same result as putting set.seed(k)
before the code.
In addition, using this method requires an additional input from the user, so I think using the future.seed
option here may not be the best option.
Thank you!
I think it's a good idea, but I don't understand why step 1 is necessary?
One easy workaround is to draw a matrix of indices first (number of observations x number of bootstraps) before going into the parallel routine.
Then pass a column to each worker.
This way there is nothing random going on inside the parallel part.
The downside is it uses more memory, since we have to store the indices all at once.
But if the matrix is declared to be an integer type (I think you use L
in R to do this), then maybe it shouldn't be so bad?
Before doing a workaround like this, I think it's important to explore the seed options that the future
and future.apply
packages have.
Just looking at the manual for future.apply
, they seem to have invested a lot of energy into thinking about this issue.
For example see pg. 10--11 here.
We should use their expertise instead of trying to do it ourselves.
Just from briefly reading the manual, it seems future_lapply
allows one to include a seed for each bootstrap by passing a list of integers to future.seed
.
It is not important that these integers seed the RNG in the same way as one would using set.seed
.
The fact that they are different is probably just because they are using a different RNG algorithm than set.seed
is.
The important thing is that the user gets the same result every time if they use e.g. set.seed(5)
before calling the function.
It looks like future_lapply
has this functionality built into its design.
Yes, I think step 1 of my original approach is not so necessary because the code should not be affecting the seed.
Regarding the seed from future_lapply
, I thought it is important to make sure the information we get from future_lapply
are the same as one would get from set.seed
, and thus, I have suggested the other method.
To use their option, I think passing the state of the RNG to the future.seed
option can achieve the same result every time for a given seed set by set.seed
.
For instance, the following code will give the same list of uniformly distributed numbers every time.
set.seed(5)
seed <- .Random.seed
a <- future.apply::future_lapply(1:2, runif, future.seed = seed)
Hence, the revised workaround to the problem using the future.seed
option for the toy example again is:
toy.example2 <- function(R, data) {
# Step 1: Get the current state of the RNG
seed <- .Random.seed
# Step 2: Define the function for resampling and calculating the mean and pass to `future_lapply`
resample_mean <- function(x, data) {
data.bs <- data[sample(1:length(data), replace = TRUE)]
return(mean(data.bs))
}
# Compute the mean by future_lapply
means <- future.apply::future_lapply(1:R, FUN = resample_mean, data = data, future.seed = seed)
return(unlist(means))
}
Then, doing the following
set.seed(1)
x <- toy.example2(10, data = 1:1000)
give the same results every time.
Do you think this method would make more sense? Thanks!
I think that's a great solution!
Just to be safe, I would also do a quick check that the randomness is being preserved with this method. An easy way to do so is draw many random uniform [0,1] variables in parallel. Then compute the empirical quantile function and compare it to the quantile for a uniform [0,1]. It should look very close if there are many draws and they are indeed randomly generated.
Then I would also check the serial correlation in the draws (i.e. corr of element i with i-1). That should also be close to 0 if everything is working.
Sure, I have generated 1,000 sequences of random numbers using future_lapply
where each sequence has 10,000 numbers generated from runif
.
future::plan(multisession, workers = 7)
r <- list()
R <- 1000
for (i in 1:R) {
set.seed(i)
seed <- .Random.seed
r[[i]] <- unlist(future.apply::future_lapply(rep(100,100), runif, future.seed = seed))
}
The serial correlations of these 1,000 sequences are obtained by:
sercor <- NULL
for (i in 1:R) {
sercor <- c(sercor, arima(r[[i]], order = c(1,0,0))$coef[1])
}
The distribution plot is as follows. They are centered at 0 and are mostly close to 0.
On the other hand, I also looked at their empirical quantile distributions and they are pretty close to the true quantile. I am not sure what is the best way to visualize the results, but I have picked the nine quantiles (0.1 to 0.9) and plot of the deviation of the empirical quantile from the true quantile. I think the deviation are quite small as well.
Ok great, this all looks good.
One way to visualize the quantiles is to construct a QQ-plot: The empirical quantile against the 45 degree line (restricted to [0,1]) as a function of the quantile. You could put a simulation standard error band around it as well.
I see, please find the plot below:
I have added the standard error band around the QQ-plot.
The red dotted line is the 45-degree line.
I think this also verifies that the randomness is being preserved in the future_lapply
method.
I am going to go ahead to modify the tests to incorporate the future_lapply
method. Thanks!
That certainly looks uniform! Great, I will mark this closed.
Oops, I realize now I closed this too soon since the actual implementation of future
still needs to be done.
I have just updated the codes by implementing parallelization with the future_lapply
function.
The README
, functions, tests and example codes have also been updated.
I am still exploring the progressr
package to see if it can be implemented when the test consists of several parts that uses parallelization (e.g. fsst
). I will add back the progress bar if it can be done.
Thanks!
I noticed an issue when I am further testing the functions.
I think the issue is related to the future package and global variables.
The code and data that replicates the issue can be found here.
(This is actually the same as the data set in issue #44 except I changed the test from estbounds
to dkqs
).
When I run the code without using declaring the number of workers and multisession via the plan
function, there is no error.
However, when I run the code with multisession or with more than one worker, all bootstrap replications failed.
When I go to check the error messages via r$df.error
, the error messages are all could not find function "betaobs"
.
Thus, I think it is because it cannot find the betaobs
function that is defined inside the b
function.
If I change the code slightly by dropping the dgp
argument from the beta.obs
function and directly passing it to the beta.obs
component in the lpmodel
, it returns an error of object 'dgp' not found
.
I am trying to see how to solve this issue. I saw there is a common issues page for the future
package that mentions some similar problems, but it is not directly related to the future.apply
package and the problem. I will try to see what will be a workaround to this issue.
How is this coming along?
Sorry that I am still working on it.
For the progress bar, I cannot find one that works with the future package and allow me to change the starting index of the progress bar. So, I am unable to integrate a progress bar in the FSST procedure where it consists of several future_lapply
function to compute the objects (e.g. beta.star, bootstrap cone test statistics).
On the other hand, I am still trying to find a proper workaround for the issue that is related to the global variables.
It's ok not to have a progress bar.
The global problem you mention seems like it may have to do with whether the namespace is correctly defined on the workers? Have you tried a different example?
One solution that I can think of is to use the future.globals
option in the future_lapply
function.
If the user-defined functions and objects that are used inside the lpmodel
object (in this case, they are betaobs
and dgp
) are passed as a list to future.globals
, then the inference procedures can be run without error.
So, I am thinking if an appropriate solution is to ask the user to pass them to the inference procedure directly if they want to use more than one worker.
Take the dkqs
procedure as an example, I can add ellipsis to the function arguments so the user-defined functions and objects can be passed to the function.
The data from issue 44 and the revised dkqs
code can be found here.
For instance, since betaobs
and dgp
are the user-defined function and object that are used inside the beta.obs
component of lpmodel
, users can pass them to the revised dkqs
function as follows if they use more than one workers:
plan(multisession, workers = 2)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1, betaobs = betaobs, dgp = dgp)
If they use only one worker, then they do not need to specify them and the following code can be run properly:
plan(multisession, workers = 1)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1)
Do you think this solution makes sense? Thanks!
Re: progress bar, I think I misread what you wrote before:
So, I am unable to integrate a progress bar in the FSST procedure where it consists of several future_lapply function to compute the objects (e.g. beta.star, bootstrap cone test statistics).
Does this mean that you could have different progress bars for each section of the procedure? e.g. one for beta.star, one for bootstrap cone test, etc.? That would be preferable to no status bar.
Regarding the other problem, that doesn't seem like such a great solution to me. Have you tried another example besides the one from issue #44? Maybe this is something to do with how I defined the functions there.
Regarding the progress bar, yes that is what I meant.
There can be one bar for each of the procedures that uses the future_lapply
function and each of them has to start from 0%.
Regarding the other issue, I have tried other examples.
In the following example, since the betaobs
function does not contain other user-defined functions or objects that are defined outside betaobs
, then there is no error.
rm(list = ls())
library(future)
library(lpinfer)
N <- nrow(sampledata)
J1 <- length(unique(sampledata[,"Y"]))
yp <- seq(0, 1, .1)
# Construct objects in `lpmodel`
Aobs.full <- cbind(matrix(rep(0, J1 * J1), nrow = J1), diag(1, J1))
Atgt <- matrix(c(yp, yp), nrow = 1)
Ashp <- matrix(rep(1, ncol(Aobs.full)), nrow = 1)
# Construct beta.obs (a function)
betaobs <- function(data) {
beta <- NULL
yp <- seq(0, 1, .1)
n <- nrow(data)
for (i in seq_along(yp)) {
beta.i <- sum((data[,"Y"] == yp[i]) * (data[,"D"] == 1))/n
beta <- c(beta, c(beta.i))
}
beta <- as.matrix(beta)
return(list(beta = beta,
var = diag(length(yp))))
}
# Define the lpmodel object
lpm.full <- lpmodel(A.obs = Aobs.full,
A.tgt = Atgt,
A.shp = Ashp,
beta.obs = betaobs,
beta.shp = c(1))
# No error if more than one worker is used
plan(multisession, workers = 2)
r <- dkqs(sampledata, lpmodel = lpm.full, beta.tgt = .36, tau = .1)
Thanks!
I'm a bit confused. Your debug44-72.R
script looks like this:
rm(list = ls())
library("tidyverse")
library("lpinfer")
library("future")
load("issue44.RData")
betaobs <- function(data, dgp) {
w <- dplyr::as_tibble(dgp$wdist[,-1])
cn <- colnames(w)
w$bin <- seq_len(nrow(w))
d <- dplyr::right_join(w, data, by = cn) # merge the bin indicator into data
r <- lm(data = d, y ~ factor(bin) + 0)
vc <- sandwich::vcovHC(r, type = "HC1")
if (length(r$coefficients) != nrow(w)) {
print(r$coefficients)
stop("Sample probability is not defined, need to quit.")
}
return(list(betaobs = r$coefficients, betaobs_vc = vc))
}
b <- function(d) betaobs(d, dgp)
lpm <- lpinfer::lpmodel(A.obs = Aobs, beta.obs = b,
A.shp = Ashp, beta.shp = 1,
A.tgt = Atgt)
plan(multisession, workers = 1)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1) # ok
plan(multisession, workers = 2)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1) # not ok
plan(multisession, workers = 2)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1,
betaobs = betaobs, dgp = dgp) # ok
How is that different from the one you just posted? It doesn't look like this betaobs
function depends on any user-created objects, unless I am missing one.
In the debug44-72.R
script, the function that is passed to the beta.obs
component of lpmodel
is the function b
.
The function b
depends on the user-defined function betaobs
and object dgp
respectively.
Hence, if I simply pass the object lpm
to future_lapply
without specifying the argument future.globals = list(betaobs = betaobs, dgp = dgp)
, the function will return an error and say they are unable to find the object.
On the other hand, in the last comment, betaobs
is the function that is passed to the beta.obs
component of lpmodel
.
Everything inside the betaobs
function that I have posted in the last comment are hard coded except data
, so there will be no error.
Got it, thanks for the clarification What if you specify
future.globals = list(betaobs = betaobs, dgp = dgp)
before calling dkqs
?
i.e., we leave it up to the user to do this outside of our routines.
Does that work?
Alternatively, have you seen this?
https://cran.r-project.org/web/packages/future/vignettes/future-4-issues.html
under "Missing globals"
That looks like this problem with lpm
playing the role of x
, doesn't it?
Or perhaps with lpm$beta.obs
playing the role of x
.
Does the simple workaround provided there work here?
Also, regarding the progress bar, we should just have one for each sub-procedure then. That is better than not having anything.
future.globals
is an argument in the future_lapply
function, so it cannot be specified before calling our procedures.
I think the workarounds there does not work because x
cannot be inserted at the front of future_lapply
, as in the future
command. On the other hand, it seems that the method of specifying the globals in the link is similar to specifying future.globals
in future_lapply
.
I have just think of another workaround to the issue.
It is to pass the lpmodel
object to the X
argument directly in future_lapply
, where X
is a list of vector-like object.
I have tried this in a toy example and it works.
The toy example is as follows:
# Function and object defined outside the `c` function
a <- function(x) x^2
b <- 1
# The function below is like the `b` function in the debug44-72.R script
c <- function(x) {
a(x) + b
}
# The following is like a function that represents a bootstrap replication in the procedure
bs <- function(x, f) f(x)^2
# Example 1: The following does not work
future.apply::future_lapply(1:2,
FUN = bs,
f = c)
# Example 2: The following works
future.apply::future_lapply(1:2,
FUN = bs,
f = c,
future.globals = list(a = a, b = b))
# Example 3: Alternatively, if I append the function bs to the each element of 1:3, I do not need to specify future.globals
x.list <- list(list(1, c), list(2, c))
bs.new <- function(x) (x[[2]](x[[1]]))^2
future.apply::future_lapply(x.list,
FUN = bs.new)
In example 1, the error message is Error in a(x) : could not find function "a"
.
In example 2 and 3, the output are:
[[1]]
[1] 4
[[2]]
[1] 25
Do you think the workaround of example 3 is fine?
Re the progress bar, sure I will add them back for each sub-procedure. Thanks!
Re the progress bar, I am implementing a bar for each sub-progress using the progressr
package.
Since the bar will start from 0% again when it is evaluating another part of the procedure, or when it is evaluating extra bootstrap replications, I plan to indicate a short phrase to describe the procedure it is running on the RHS of the bar to avoid confusion. The phrase could be "Computing xxx bootstrap replications" or "Computing xxx cone test statistics", where xxx is the total number of replications.
If there are some errors in the first R bootstrap replications so that extra replications are required, it will print "Computing xxx extra bootstrap replications".
A sample output is shown in the following gif when the dkqs
procedure is used:
Do you think this would be okay? Thanks!
Example 2 looks like a great solution, very clean and like it uses future_lapply
in the way it was intended. Example 3 looks a little more like a hack. Is there a drawback to Example 2?
Yes I think your progress bar looks great!
For example 2, I think the drawback is that future.globals
has to be something that is specified by the user.
This is because we will not know in advance what are the names of the functions or objects the user will use inside the lpmodel
object (this corresponds to the objects a
and b
in the example).
Hence, I think we need to add another optional argument in the testing procedure for the users to provide the names and pass them to future.globals
.
I see. That's not great. In Example 3 you do not need to do this?
Yes, I do not need to specify future.globals
in example 3.
I have just tried to implement the method in example 3 in the dkqs
code and it works without specifying future.globals
too.
But in that case is this code something the user will need to specify?
x.list <- list(list(1, c), list(2, c))
bs.new <- function(x) (x[[2]](x[[1]]))^2
Or is that something you can do programmatically within DKQS?
The overarching concern here is we don't want to make life difficult for the user.
This is something that can be done within dkqs
.
Sorry for being unclear on which are the objects being created by the user in the example.
In the toy example:
c
is referring to the lpmodel
object, so this is something created by the user.bs
function is referring to a function inside the dkqs
procedure that computes one bootstrap replication.In example 3, I am rewriting bs
as bs.new
(equivalently, I am rewriting a function inside the dkqs
procedure).
The object x.list
is appending the function c
(equivalently, the lpmodel
object in dkqs
) to the list of indices 1:2 in the example. This step is also something that is done inside the dkqs
code.
So I think this change can be done programmatically within dkqs
.
Ok, thanks -- I think I get it now. I guess one downside is that if the lpmodel
object is large, then we are creating one copy of it per BS draw. Is that right?
Yes, I think so.
On the other hand, I am thinking the memory issue of large lpmodel
object can be eliminated if I do not create x.list
explicitly, i.e. in the example above, instead of creating the x.list
object and then pass it to future_lapply
, I create this object in the function argument as follows:
bs.new <- function(x) (x[[2]](x[[1]]))^2
future.apply::future_lapply(lapply(as.list(1:2), FUN = function(x) append(x, c)),
FUN = bs.new)
Do you think this would help?
I'm not sure, I think you'd want to check the memory usage formally with some tool to see.
More generally, it seems like this is a common design issue that other people using the future
packages must be running into. Maybe you should ask for help on StackExchange? I often get good advice there if I formulate a clear question and minimal working example.
I tried to use profiling to check the memory allocated and reallocated in the two programming methods via the profvis
package.
Below, I refer the method as method A if I first create an additional object that combines the list of indices with lpmodel
(i.e. the object x.list
in the above example) and then pass it to future_lapply
.
I refer the method as method B if I use a lapply
function inside the future_lapply
function to combine the indices with lpmodel
without create a new object explicitly (i.e. the method that I mentioned in my last comment).
According to the output of the profiling (please see here), I find that memory allocated and deallocated are slightly lower in method B if I compare the entire dkqs
function.
On the other hand, if I only focus on the part that uses future_lapply
function, the memory allocated in both methods are similar, while the memory deallocated is smaller in method B.
I also run a Monte Carlo simulation to check the time used for the two methods.
I repeat the dkqs
procedure for 100 times, set R = 500
and use two workers in future
.
The plot is as follows.
It looks like that these two methods are taking roughly a similar time.
Ok, this is good, but I think we also want to check how many copies R is making in the background.
For example, see here http://adv-r.had.co.nz/memory.html
Skimming that, it looks like we might be in luck in that R is smart enough not to create multiple copies of an lpmodel
.
But we want to be sure before implementing.
These lpmodel
s could in principle be very, very large in some applications, so we don't want bsreps = 500
of them hanging around.
However, I really think you should check through StackExchange or even via a post on the future
github repo to make sure this is the right way to go. We might be missing something important.
I see. Sure, I am formulating a minimal working example to demonstrate this issue when future_lapply
function is used inside a package. I will post it on the future
GitHub repo.
I also checked on StackExchange, and some suggested adding the names of the global variables to plan(...)
. However, it does not work in my examples, so I will include them in the post too.
Ok, great. I expect you will get a very high-quality answer if you have a clear MWE.
The StackExchange solution is not great (even if it did work) since it requires substantial thinking from the user. We want to make it as easy for them as possible.
I have just added back the progress bar to all the procedures in a format that is similar to the sample gif that I have posted earlier.
What was the status of this? Did we arrive at a good solution and was it implemented?
I have posted an issue on the future.apply
GitHub repository, but the author has not replied yet.
Ok. I would like to make the repository public in a few days.
Can you implement the solution that you proposed above?
Then we can change it later if we find there is a better way to use future.apply
.
Sure, I will implement the solution above in the testing procedures and check them against the examples that had problems before.
Done! I have just implemented the temporary solution.
I have tested that the solution works in the four testing procedures that uses future_lapply
.
They return are now able to return p-values under the above test case when the number of workers is more than 1.
Thanks!
Excellent, great work. Let's revisit this in case you hear back from the developer of future
with a better suggestion.
Sure, no problem!
Could you do some research on it?
In my understanding,
future
works as follows:Thus, instead of having a
cores
option, the user would do something like this:The advantage is that we do not need separate code for multicore vs. distributed computing on a cluster. That seems kind of nice.
Are there disadvantages?