conroylau / lpinfer

lpinfer: An R Package for Inference in Linear Programs
GNU General Public License v3.0
3 stars 5 forks source link

Would it be better to use the `future` package for parallelization? #72

Closed a-torgovitsky closed 4 years ago

a-torgovitsky commented 4 years ago

Could you do some research on it?

In my understanding, future works as follows:

  1. We define areas of the code that can be run in parallel.
  2. The user defines whether and how they want to use resources.

Thus, instead of having a cores option, the user would do something like this:

library("future")
plan(multicore, workers = 8)
### then run a function from lpinfer

The advantage is that we do not need separate code for multicore vs. distributed computing on a cluster. That seems kind of nice.

Are there disadvantages?

conroylau commented 4 years ago

Sorry for the delay in getting back to you on this as I am trying to understand how to use the future package properly.

From the research that I have done so far, I think using future is a good idea. In addition to the two advantages that you have mentioned, it seems to be saving time if the users are running the tests in the lpinfer package for a few times in parallel because they only need to register the number of cores in the beginning, whereas in the current version, the number of cores are registered inside the function each time when they call the test.

A common problem of running the loops in parallel in future is similar to the current method of parallelization. The problem is that I need to first compute the bootstrap beta.obs outside the parallel loop. This is illustrated by the following code where the two loops give different results although I set the seed as 1 in both cases.

> plan(multisession, workers = 1)
> set.seed(1)
> r <- listenv::listenv()
> for (i in 1:10) {
+   r[[i]] %<-% {
+       runif(1)
+   }  
+ }
> unlist(as.list(r))
 [1] 0.26550866 0.37212390 0.57285336 0.90820779 0.20168193 0.89838968 0.94467527 0.66079779 0.62911404 0.06178627
> 
> plan(multisession, workers = 8)
> set.seed(1)
> r <- listenv::listenv()
> for (i in 1:10) {
+   r[[i]] %<-% {
+       runif(1)
+   }  
+ }
> unlist(as.list(r))
 [1] 0.23093935 0.77319639 0.28942275 0.65772145 0.83080366 0.55169474 0.08890901 0.80429786 0.06902059 0.11184654

I am using multisession instead of multicore because multicore is not available in Windows or RStudio.

Another problem is that there are fewer options of using progress bars when I use the future package.

To test the future package, I have revised the subsample function by using the future package for the loops. The code can be found here.

On the other hand, I am investigating what is the most appropriate way to set up the loops. From the examples that I read about the future package, they suggest the for-loops should be done in the following way (where R refers to the number of replications):

r <- listenv::listenv()
for (i in 1:R) {
  r[[i]] %<-% {
    "Some computation here"
  }
}

However, I find that if the computation in each bootstrap replication is taking a very short time (say evaluating the solution to a LP using Gurobi), then the overall time used will be longer than not using any parallelization by future.

Therefore, the way I have done it in the subsample code is to break the R bootstrap replications into J tasks where each of the J tasks contains I replications (the last task may contain more than I replications to make the total number of tasks equals to J). This is illustrated in the following pseudo code:

r <- listenv::listenv() 
for (j in 1:J) {
  r[[j]] %<-% {
    rr <- list()
    for (i in iseq[[j]]) {
      rr[[i]] <- "Some computation here"
    }
    rr
  }
}

In the sample code for the revised subsample test, I set I as 400.

On the other hand, I also tried setting up the loop as the followings:

r <- listenv::listenv() 
r[[1]] %<-% {
  rr <- list()
  for (i in 1:R) {
    rr[[i]] <- "Some computation here"
  }
  rr
}

It seems that the above will not be run as parallel because the distribution of the computational time by the above loop for 1 core and 8 cores are nearly the same.

Below is the distribution of the time used when I run the subsampling test 100 times with 8 cores and 2,000 replications using future and running it 100 times without parallelization. I am using your data and code in issue #50 as the input for lpmodel because I want to test if the new version can capture the errors.

image

It can be seen that the time from using the future package is less than the current version except for the first time I run it after registering the number of workers to be used (which corresponds to the outlier) and verifies the additional advantage that I mentioned in the beginning. However, the time saved is not that significant, so I am trying to play with the I parameter to see how it affects the time used.

In addition, they are done by the multisession option instead of the multicore option because I was using RStudio.

Therefore, I think it is a good idea to use the future package for parallelization. However, I still need to check what is the most efficient way to write the loops. I also need to check if there is another way to implement a progress bar when the code consists of several parts (like fsst) because the current progress bar is rather restrictive.

a-torgovitsky commented 4 years ago

The problem is that I need to first compute the bootstrap beta.obs outside the parallel loop. This is illustrated by the following code where the two loops give different results although I set the seed as 1 in both cases.

> plan(multisession, workers = 1)
> set.seed(1)
> r <- listenv::listenv()
> for (i in 1:10) {
+   r[[i]] %<-% {
+       runif(1)
+   }  
+ }
> unlist(as.list(r))
 [1] 0.26550866 0.37212390 0.57285336 0.90820779 0.20168193 0.89838968 0.94467527 0.66079779 0.62911404 0.06178627
> 
> plan(multisession, workers = 8)
> set.seed(1)
> r <- listenv::listenv()
> for (i in 1:10) {
+   r[[i]] %<-% {
+       runif(1)
+   }  
+ }
> unlist(as.list(r))
 [1] 0.23093935 0.77319639 0.28942275 0.65772145 0.83080366 0.55169474 0.08890901 0.80429786 0.06902059 0.11184654

I usually do something like this:

plan(multisession, workers = 8)
base_seed <- 1
r <- listenv::listenv()
  for (i in 1:10) {
    r[[i]] %<-% {
      set.seed(base_seed + i)
      runif(1)
    }  
}
unlist(as.list(r))

Would that help solve this problem?

Another problem is that there are fewer options of using progress bars when I use the future package.

The progress bars are nice, but it's not essential. Functionality is more important I think.

To test the future package, I have revised the subsample function by using the future package for the loops. The code can be found here.

On the other hand, I am investigating what is the most appropriate way to set up the loops. From the examples that I read about the future package, they suggest the for-loops should be done in the following way (where R refers to the number of replications):

r <- listenv::listenv()
for (i in 1:R) {
  r[[i]] %<-% {
    "Some computation here"
  }
}

Have you looked at this? https://cran.r-project.org/web/packages/future.apply/index.html

However, I find that if the computation in each bootstrap replication is taking a very short time (say evaluating the solution to a LP using Gurobi), then the overall time used will be longer than not using any parallelization by future.

I am very surprised by that. The overhead from parallelization should be tiny compared to solving an LP through Gurobi. I wonder are you spawning new Gurobi objects for each worker? I think something like that needs to be done. You should be seeing something close to linear speedups (using parallel, future, or anything else) when parallelizing the bootstrap. That means 4 cores = almost 4x faster than 1 core.

Therefore, the way I have done it in the subsample code is to break the R bootstrap replications into J tasks where each of the J tasks contains I replications (the last task may contain more than I replications to make the total number of tasks equals to J). This is illustrated in the following pseudo code:

r <- listenv::listenv() 
for (j in 1:J) {
  r[[j]] %<-% {
    rr <- list()
    for (i in iseq[[j]]) {
      rr[[i]] <- "Some computation here"
    }
    rr
  }
}

In the sample code for the revised subsample test, I set I as 400.

This seems error-prone. I would recommend against it.

On the other hand, I also tried setting up the loop as the followings:

r <- listenv::listenv() 
r[[1]] %<-% {
  rr <- list()
  for (i in 1:R) {
    rr[[i]] <- "Some computation here"
  }
  rr
}

It seems that the above will not be run as parallel because the distribution of the computational time by the above loop for 1 core and 8 cores are nearly the same.

Again, I don't think that makes sense. Parallelization is used effectively for much less costly computations than solving a linear program. You may want to try profiling the code to check as well.

Below is the distribution of the time used when I run the subsampling test 100 times with 8 cores and 2,000 replications using future and running it 100 times without parallelization. I am using your data and code in issue #50 as the input for lpmodel because I want to test if the new version can capture the errors.

image

It can be seen that the time from using the future package is less than the current version except for the first time I run it after registering the number of workers to be used (which corresponds to the outlier) and verifies the additional advantage that I mentioned in the beginning. However, the time saved is not that significant, so I am trying to play with the I parameter to see how it affects the time used.

In addition, they are done by the multisession option instead of the multicore option because I was using RStudio.

Therefore, I think it is a good idea to use the future package for parallelization. However, I still need to check what is the most efficient way to write the loops. I also need to check if there is another way to implement a progress bar when the code consists of several parts (like fsst) because the current progress bar is rather restrictive.

conroylau commented 4 years ago

I will look into the package you mentioned and look into the code where I apply the parallelization.

Regarding the seed issue, I think that helps to solve the problem. I originally wrote the tests in a form like that, but I thought we want the seed to be set outside the functions for the tests in the package, so that we can drop the input like base_seed in our package. Do you think I should incorporate the base_seed option back to the package so that it can solve the issue regarding the random numbers? Thanks!

a-torgovitsky commented 4 years ago

Yes that's a good point, I suppose this conflicts with that. Is it possible for the user to easily work around this by setting the seed on each worker before running our code?

conroylau commented 4 years ago

Hm I am not sure about this. I will look into that to see if it is possible.

conroylau commented 4 years ago

So far, I cannot find a way of incorporating the seed into the future package so that the random numbers generated via future_lapply are the same as the random numbers obtained from a for-loop without parallelization. However, I have think of a workaround to the problem.

My idea is as follows:

  1. Get the current RNG state through .Random.seed.
  2. Use the state obtained from step 1 to generate a list of indices that represent how the data is resampled.
  3. Pass this list to the future_lapply function to conduct the bootstrap replications (for each bootstrap replication, use the index to obtain the bootstrap beta.obs from the bootstrap data and then solve the linear/quadratic programs).

Below is a toy example for the above idea:

toy.example <- function(R, data) {
  # Step 1: Get the current state of the RNG
  seed <- .Random.seed

  # Some other computations or functions here, e.g. checking input
  # or calculating the parameters here

  # Step 2: Generate the list of indices and set the state to make sure it is the same
  # as the beginning state provided by the user
  id <- list()
  .Random.seed <- seed
  for (i in 1:R) {
    id[[i]] <- sample(data, replace = TRUE)
  }

  # Step 3: Define the function for resampling and calculating the mean and pass to `future_lapply`
  resample_mean <- function(id, data) {
    data.bs <- data[id]
    return(mean(data.bs))
  }
  # Compute the mean by future_lapply
  means <- future.apply::future_lapply(id, FUN = resample_mean, data = data)

  return(unlist(means))
}

If the data is just a vector of integers from 1 to 1000:

data <- 1:1000

Then, the following code

set.seed(1)
mus1 <- NULL
for (i in 1:10) {
  data.bs <- sample(data, replace = TRUE)
  mus1 <- c(mus1, mean(data.bs))
}

and the following code

library(future)
plan(multisession, workers = 3)
set.seed(1)
mus2 <- toy.example(10, data)

yield the same result.

I am adding step 1 in the procedure just to make sure that the state of the RNG when I draw the indices is the same as the state after the user set the seed.

In addition, running step 2 without parallelization does not take a very long time either. For instance, drawing the resampled indices of a data with 10,000 observations for 1,000 times take less than 0.5 seconds on average without parallelization.

Therefore, may I know do you think the above workaround makes sense?

On the other hand, there is an option of future.seed in the future_lapply function. But setting it as future.seed = k does not give the same result as putting set.seed(k) before the code. In addition, using this method requires an additional input from the user, so I think using the future.seed option here may not be the best option.

Thank you!

a-torgovitsky commented 4 years ago

I think it's a good idea, but I don't understand why step 1 is necessary?

One easy workaround is to draw a matrix of indices first (number of observations x number of bootstraps) before going into the parallel routine. Then pass a column to each worker. This way there is nothing random going on inside the parallel part. The downside is it uses more memory, since we have to store the indices all at once. But if the matrix is declared to be an integer type (I think you use L in R to do this), then maybe it shouldn't be so bad?


Before doing a workaround like this, I think it's important to explore the seed options that the future and future.apply packages have. Just looking at the manual for future.apply, they seem to have invested a lot of energy into thinking about this issue. For example see pg. 10--11 here. We should use their expertise instead of trying to do it ourselves.

Just from briefly reading the manual, it seems future_lapply allows one to include a seed for each bootstrap by passing a list of integers to future.seed. It is not important that these integers seed the RNG in the same way as one would using set.seed. The fact that they are different is probably just because they are using a different RNG algorithm than set.seed is. The important thing is that the user gets the same result every time if they use e.g. set.seed(5) before calling the function. It looks like future_lapply has this functionality built into its design.

conroylau commented 4 years ago

Yes, I think step 1 of my original approach is not so necessary because the code should not be affecting the seed.

Regarding the seed from future_lapply, I thought it is important to make sure the information we get from future_lapply are the same as one would get from set.seed, and thus, I have suggested the other method.

To use their option, I think passing the state of the RNG to the future.seed option can achieve the same result every time for a given seed set by set.seed.

For instance, the following code will give the same list of uniformly distributed numbers every time.

set.seed(5)
seed <- .Random.seed
a <- future.apply::future_lapply(1:2, runif, future.seed = seed)

Hence, the revised workaround to the problem using the future.seed option for the toy example again is:

toy.example2 <- function(R, data) {
  # Step 1: Get the current state of the RNG
  seed <- .Random.seed

  # Step 2: Define the function for resampling and calculating the mean and pass to `future_lapply`
  resample_mean <- function(x, data) {
    data.bs <- data[sample(1:length(data), replace = TRUE)]
    return(mean(data.bs))
  }
  # Compute the mean by future_lapply
  means <- future.apply::future_lapply(1:R, FUN = resample_mean, data = data, future.seed = seed)

  return(unlist(means))
}

Then, doing the following

set.seed(1)
x <- toy.example2(10, data = 1:1000)

give the same results every time.

Do you think this method would make more sense? Thanks!

a-torgovitsky commented 4 years ago

I think that's a great solution!

Just to be safe, I would also do a quick check that the randomness is being preserved with this method. An easy way to do so is draw many random uniform [0,1] variables in parallel. Then compute the empirical quantile function and compare it to the quantile for a uniform [0,1]. It should look very close if there are many draws and they are indeed randomly generated.

Then I would also check the serial correlation in the draws (i.e. corr of element i with i-1). That should also be close to 0 if everything is working.

conroylau commented 4 years ago

Sure, I have generated 1,000 sequences of random numbers using future_lapply where each sequence has 10,000 numbers generated from runif.

future::plan(multisession, workers = 7)
r <- list()
R <- 1000
for (i in 1:R) {
  set.seed(i)
  seed <- .Random.seed
  r[[i]] <- unlist(future.apply::future_lapply(rep(100,100), runif, future.seed = seed))
}

The serial correlations of these 1,000 sequences are obtained by:

sercor <- NULL
for (i in 1:R) {
  sercor <- c(sercor, arima(r[[i]], order = c(1,0,0))$coef[1])
}

The distribution plot is as follows. They are centered at 0 and are mostly close to 0. image

On the other hand, I also looked at their empirical quantile distributions and they are pretty close to the true quantile. I am not sure what is the best way to visualize the results, but I have picked the nine quantiles (0.1 to 0.9) and plot of the deviation of the empirical quantile from the true quantile. image I think the deviation are quite small as well.

a-torgovitsky commented 4 years ago

Ok great, this all looks good.

One way to visualize the quantiles is to construct a QQ-plot: The empirical quantile against the 45 degree line (restricted to [0,1]) as a function of the quantile. You could put a simulation standard error band around it as well.

conroylau commented 4 years ago

I see, please find the plot below: image

I have added the standard error band around the QQ-plot. The red dotted line is the 45-degree line. I think this also verifies that the randomness is being preserved in the future_lapply method.

I am going to go ahead to modify the tests to incorporate the future_lapply method. Thanks!

a-torgovitsky commented 4 years ago

That certainly looks uniform! Great, I will mark this closed.

a-torgovitsky commented 4 years ago

Oops, I realize now I closed this too soon since the actual implementation of futurestill needs to be done.

conroylau commented 4 years ago

I have just updated the codes by implementing parallelization with the future_lapply function. The README, functions, tests and example codes have also been updated.

I am still exploring the progressr package to see if it can be implemented when the test consists of several parts that uses parallelization (e.g. fsst). I will add back the progress bar if it can be done.

Thanks!

conroylau commented 4 years ago

I noticed an issue when I am further testing the functions.

I think the issue is related to the future package and global variables. The code and data that replicates the issue can be found here. (This is actually the same as the data set in issue #44 except I changed the test from estbounds to dkqs).

When I run the code without using declaring the number of workers and multisession via the plan function, there is no error. However, when I run the code with multisession or with more than one worker, all bootstrap replications failed. When I go to check the error messages via r$df.error, the error messages are all could not find function "betaobs". Thus, I think it is because it cannot find the betaobs function that is defined inside the b function.

If I change the code slightly by dropping the dgp argument from the beta.obs function and directly passing it to the beta.obs component in the lpmodel, it returns an error of object 'dgp' not found.

I am trying to see how to solve this issue. I saw there is a common issues page for the future package that mentions some similar problems, but it is not directly related to the future.apply package and the problem. I will try to see what will be a workaround to this issue.

a-torgovitsky commented 4 years ago

How is this coming along?

conroylau commented 4 years ago

Sorry that I am still working on it. For the progress bar, I cannot find one that works with the future package and allow me to change the starting index of the progress bar. So, I am unable to integrate a progress bar in the FSST procedure where it consists of several future_lapply function to compute the objects (e.g. beta.star, bootstrap cone test statistics).

On the other hand, I am still trying to find a proper workaround for the issue that is related to the global variables.

a-torgovitsky commented 4 years ago

It's ok not to have a progress bar.

The global problem you mention seems like it may have to do with whether the namespace is correctly defined on the workers? Have you tried a different example?

conroylau commented 4 years ago

One solution that I can think of is to use the future.globals option in the future_lapply function. If the user-defined functions and objects that are used inside the lpmodel object (in this case, they are betaobs and dgp) are passed as a list to future.globals, then the inference procedures can be run without error. So, I am thinking if an appropriate solution is to ask the user to pass them to the inference procedure directly if they want to use more than one worker.

Take the dkqs procedure as an example, I can add ellipsis to the function arguments so the user-defined functions and objects can be passed to the function. The data from issue 44 and the revised dkqs code can be found here.

For instance, since betaobs and dgp are the user-defined function and object that are used inside the beta.obs component of lpmodel, users can pass them to the revised dkqs function as follows if they use more than one workers:

plan(multisession, workers = 2)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1, betaobs = betaobs, dgp = dgp)

If they use only one worker, then they do not need to specify them and the following code can be run properly:

plan(multisession, workers = 1)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1)

Do you think this solution makes sense? Thanks!

a-torgovitsky commented 4 years ago

Re: progress bar, I think I misread what you wrote before:

So, I am unable to integrate a progress bar in the FSST procedure where it consists of several future_lapply function to compute the objects (e.g. beta.star, bootstrap cone test statistics).

Does this mean that you could have different progress bars for each section of the procedure? e.g. one for beta.star, one for bootstrap cone test, etc.? That would be preferable to no status bar.


Regarding the other problem, that doesn't seem like such a great solution to me. Have you tried another example besides the one from issue #44? Maybe this is something to do with how I defined the functions there.

conroylau commented 4 years ago

Regarding the progress bar, yes that is what I meant. There can be one bar for each of the procedures that uses the future_lapply function and each of them has to start from 0%.

Regarding the other issue, I have tried other examples. In the following example, since the betaobs function does not contain other user-defined functions or objects that are defined outside betaobs, then there is no error.

rm(list = ls())
library(future)
library(lpinfer)

N <- nrow(sampledata)
J1 <- length(unique(sampledata[,"Y"]))
yp <- seq(0, 1, .1)

# Construct objects in `lpmodel`
Aobs.full <- cbind(matrix(rep(0, J1 * J1), nrow = J1), diag(1, J1))
Atgt <- matrix(c(yp, yp), nrow = 1)
Ashp <- matrix(rep(1, ncol(Aobs.full)), nrow = 1)

# Construct beta.obs (a function)
betaobs <- function(data) {
  beta <- NULL
  yp <- seq(0, 1, .1)
  n <- nrow(data)
  for (i in seq_along(yp)) {
    beta.i <- sum((data[,"Y"] == yp[i]) * (data[,"D"] == 1))/n
    beta <- c(beta, c(beta.i))
  }
  beta <- as.matrix(beta)
  return(list(beta = beta,
              var = diag(length(yp))))
}

# Define the lpmodel object
lpm.full <- lpmodel(A.obs    = Aobs.full,
                    A.tgt    = Atgt,
                    A.shp    = Ashp,
                    beta.obs = betaobs,
                    beta.shp = c(1))

# No error if more than one worker is used
plan(multisession, workers = 2)
r <- dkqs(sampledata, lpmodel = lpm.full, beta.tgt = .36, tau = .1)

Thanks!

a-torgovitsky commented 4 years ago

I'm a bit confused. Your debug44-72.R script looks like this:

rm(list = ls())
library("tidyverse")
library("lpinfer")
library("future")

load("issue44.RData")

betaobs <- function(data, dgp) {
  w <- dplyr::as_tibble(dgp$wdist[,-1])
  cn <- colnames(w)
  w$bin <- seq_len(nrow(w))
  d <- dplyr::right_join(w, data, by = cn) # merge the bin indicator into data
  r <- lm(data = d, y ~ factor(bin) + 0)
  vc <- sandwich::vcovHC(r, type = "HC1")
  if (length(r$coefficients) != nrow(w)) {
    print(r$coefficients)
    stop("Sample probability is not defined, need to quit.")
  }
  return(list(betaobs = r$coefficients, betaobs_vc = vc))
}
b <- function(d) betaobs(d, dgp)

lpm <- lpinfer::lpmodel(A.obs = Aobs, beta.obs = b,
                        A.shp = Ashp, beta.shp = 1,
                        A.tgt = Atgt)

plan(multisession, workers = 1)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1) # ok

plan(multisession, workers = 2)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1) # not ok

plan(multisession, workers = 2)
set.seed(1)
r <- dkqs(data, lpm, solver = "gurobi", beta.tgt = .1, tau = .1, 
          betaobs = betaobs, dgp = dgp) # ok

How is that different from the one you just posted? It doesn't look like this betaobs function depends on any user-created objects, unless I am missing one.

conroylau commented 4 years ago

In the debug44-72.R script, the function that is passed to the beta.obs component of lpmodel is the function b. The function b depends on the user-defined function betaobs and object dgp respectively. Hence, if I simply pass the object lpm to future_lapply without specifying the argument future.globals = list(betaobs = betaobs, dgp = dgp), the function will return an error and say they are unable to find the object.

On the other hand, in the last comment, betaobs is the function that is passed to the beta.obs component of lpmodel. Everything inside the betaobs function that I have posted in the last comment are hard coded except data, so there will be no error.

a-torgovitsky commented 4 years ago

Got it, thanks for the clarification What if you specify

future.globals = list(betaobs = betaobs, dgp = dgp)

before calling dkqs? i.e., we leave it up to the user to do this outside of our routines. Does that work?

Alternatively, have you seen this? https://cran.r-project.org/web/packages/future/vignettes/future-4-issues.html under "Missing globals" That looks like this problem with lpm playing the role of x, doesn't it? Or perhaps with lpm$beta.obs playing the role of x. Does the simple workaround provided there work here?


Also, regarding the progress bar, we should just have one for each sub-procedure then. That is better than not having anything.

conroylau commented 4 years ago

future.globals is an argument in the future_lapply function, so it cannot be specified before calling our procedures.

I think the workarounds there does not work because x cannot be inserted at the front of future_lapply, as in the futurecommand. On the other hand, it seems that the method of specifying the globals in the link is similar to specifying future.globals in future_lapply.

I have just think of another workaround to the issue. It is to pass the lpmodel object to the X argument directly in future_lapply, where X is a list of vector-like object. I have tried this in a toy example and it works. The toy example is as follows:

# Function and object defined outside the `c` function
a <- function(x) x^2
b <- 1

# The function below is like the `b` function in the debug44-72.R script
c <- function(x) {
  a(x) + b
}

# The following is like a function that represents a bootstrap replication in the procedure
bs <- function(x, f) f(x)^2

# Example 1: The following does not work
future.apply::future_lapply(1:2,
                            FUN = bs,
                            f = c)

# Example 2: The following works
future.apply::future_lapply(1:2,
                            FUN = bs,
                            f = c,
                            future.globals = list(a = a, b = b))

# Example 3: Alternatively, if I append the function bs to the each element of 1:3, I do not need to specify future.globals
x.list <- list(list(1, c), list(2, c))
bs.new <- function(x) (x[[2]](x[[1]]))^2
future.apply::future_lapply(x.list,
                            FUN = bs.new)

In example 1, the error message is Error in a(x) : could not find function "a". In example 2 and 3, the output are:

[[1]]
[1] 4

[[2]]
[1] 25

Do you think the workaround of example 3 is fine?

Re the progress bar, sure I will add them back for each sub-procedure. Thanks!

conroylau commented 4 years ago

Re the progress bar, I am implementing a bar for each sub-progress using the progressr package.

Since the bar will start from 0% again when it is evaluating another part of the procedure, or when it is evaluating extra bootstrap replications, I plan to indicate a short phrase to describe the procedure it is running on the RHS of the bar to avoid confusion. The phrase could be "Computing xxx bootstrap replications" or "Computing xxx cone test statistics", where xxx is the total number of replications.

If there are some errors in the first R bootstrap replications so that extra replications are required, it will print "Computing xxx extra bootstrap replications".

A sample output is shown in the following gif when the dkqs procedure is used: progress_bar_gif

Do you think this would be okay? Thanks!

a-torgovitsky commented 4 years ago

Example 2 looks like a great solution, very clean and like it uses future_lapply in the way it was intended. Example 3 looks a little more like a hack. Is there a drawback to Example 2?


Yes I think your progress bar looks great!

conroylau commented 4 years ago

For example 2, I think the drawback is that future.globals has to be something that is specified by the user. This is because we will not know in advance what are the names of the functions or objects the user will use inside the lpmodel object (this corresponds to the objects a and b in the example). Hence, I think we need to add another optional argument in the testing procedure for the users to provide the names and pass them to future.globals.

a-torgovitsky commented 4 years ago

I see. That's not great. In Example 3 you do not need to do this?

conroylau commented 4 years ago

Yes, I do not need to specify future.globals in example 3. I have just tried to implement the method in example 3 in the dkqs code and it works without specifying future.globals too.

a-torgovitsky commented 4 years ago

But in that case is this code something the user will need to specify?

x.list <- list(list(1, c), list(2, c))
bs.new <- function(x) (x[[2]](x[[1]]))^2

Or is that something you can do programmatically within DKQS?

The overarching concern here is we don't want to make life difficult for the user.

conroylau commented 4 years ago

This is something that can be done within dkqs. Sorry for being unclear on which are the objects being created by the user in the example.

In the toy example:

In example 3, I am rewriting bs as bs.new (equivalently, I am rewriting a function inside the dkqs procedure). The object x.list is appending the function c (equivalently, the lpmodel object in dkqs) to the list of indices 1:2 in the example. This step is also something that is done inside the dkqs code.

So I think this change can be done programmatically within dkqs.

a-torgovitsky commented 4 years ago

Ok, thanks -- I think I get it now. I guess one downside is that if the lpmodel object is large, then we are creating one copy of it per BS draw. Is that right?

conroylau commented 4 years ago

Yes, I think so. On the other hand, I am thinking the memory issue of large lpmodel object can be eliminated if I do not create x.list explicitly, i.e. in the example above, instead of creating the x.list object and then pass it to future_lapply, I create this object in the function argument as follows:

bs.new <- function(x) (x[[2]](x[[1]]))^2
future.apply::future_lapply(lapply(as.list(1:2), FUN = function(x) append(x, c)),
                            FUN = bs.new)

Do you think this would help?

a-torgovitsky commented 4 years ago

I'm not sure, I think you'd want to check the memory usage formally with some tool to see.

More generally, it seems like this is a common design issue that other people using the future packages must be running into. Maybe you should ask for help on StackExchange? I often get good advice there if I formulate a clear question and minimal working example.

conroylau commented 4 years ago

I tried to use profiling to check the memory allocated and reallocated in the two programming methods via the profvis package. Below, I refer the method as method A if I first create an additional object that combines the list of indices with lpmodel (i.e. the object x.list in the above example) and then pass it to future_lapply. I refer the method as method B if I use a lapply function inside the future_lapply function to combine the indices with lpmodel without create a new object explicitly (i.e. the method that I mentioned in my last comment).

According to the output of the profiling (please see here), I find that memory allocated and deallocated are slightly lower in method B if I compare the entire dkqs function. On the other hand, if I only focus on the part that uses future_lapply function, the memory allocated in both methods are similar, while the memory deallocated is smaller in method B.

I also run a Monte Carlo simulation to check the time used for the two methods. I repeat the dkqs procedure for 100 times, set R = 500 and use two workers in future. The plot is as follows. Rplot It looks like that these two methods are taking roughly a similar time.

a-torgovitsky commented 4 years ago

Ok, this is good, but I think we also want to check how many copies R is making in the background. For example, see here http://adv-r.had.co.nz/memory.html Skimming that, it looks like we might be in luck in that R is smart enough not to create multiple copies of an lpmodel. But we want to be sure before implementing. These lpmodels could in principle be very, very large in some applications, so we don't want bsreps = 500 of them hanging around.


However, I really think you should check through StackExchange or even via a post on the future github repo to make sure this is the right way to go. We might be missing something important.

conroylau commented 4 years ago

I see. Sure, I am formulating a minimal working example to demonstrate this issue when future_lapply function is used inside a package. I will post it on the future GitHub repo.

I also checked on StackExchange, and some suggested adding the names of the global variables to plan(...). However, it does not work in my examples, so I will include them in the post too.

a-torgovitsky commented 4 years ago

Ok, great. I expect you will get a very high-quality answer if you have a clear MWE.

The StackExchange solution is not great (even if it did work) since it requires substantial thinking from the user. We want to make it as easy for them as possible.

conroylau commented 4 years ago

I have just added back the progress bar to all the procedures in a format that is similar to the sample gif that I have posted earlier.

a-torgovitsky commented 4 years ago

What was the status of this? Did we arrive at a good solution and was it implemented?

conroylau commented 4 years ago

I have posted an issue on the future.apply GitHub repository, but the author has not replied yet.

a-torgovitsky commented 4 years ago

Ok. I would like to make the repository public in a few days. Can you implement the solution that you proposed above? Then we can change it later if we find there is a better way to use future.apply.

conroylau commented 4 years ago

Sure, I will implement the solution above in the testing procedures and check them against the examples that had problems before.

conroylau commented 4 years ago

Done! I have just implemented the temporary solution. I have tested that the solution works in the four testing procedures that uses future_lapply. They return are now able to return p-values under the above test case when the number of workers is more than 1. Thanks!

a-torgovitsky commented 4 years ago

Excellent, great work. Let's revisit this in case you hear back from the developer of future with a better suggestion.

conroylau commented 4 years ago

Sure, no problem!