HenrikBengtsson / future.callr

:rocket: R package future.callr: A Future API for Parallel Processing using 'callr'
https://future.callr.futureverse.org
62 stars 1 forks source link

future.callr or future.batchtools with RAppArmor #13

Closed vnijs closed 3 years ago

vnijs commented 3 years ago

I'm starting to review your amazing work in the "future" series. I have been working on an extension of the mini app linked below where students can submit answers to multiple-choice, numeric, and open-ended questions and also code in R, Python, and SQL through knitr. I know about learnr but it doesn't (yet) fit my needs for testing and grading.

For the code questions in R, Python, or SQL, I'd like to run knitr in a separate process but also in a specific environment. I then need to get the (changed) environment back for testing purposes as well as the HTML returned by knitr. Based on my initial review it seems that this will all work nicely with future.callr. But then I noticed that you also have future.batchtools which has a batchtools_local option. Are there any important differences between future.callr and future.batchtools that might push me to choose one over the other?

The last thing I wanted to ask about was RAppArmor. I want to restrict access to certain files and directories when student code is run. That way they won't be able to sneek a peek at the solutions or the code tests because we want to be able to use this for graded assignments. I don't want to restrict the shiny process that the main app runs in, just the new processes started by either future.callr or future.batchtools. Would either of these future options work better (or worse) with RAppArmor on Linux?

Any advice you have would be very welcome. Unfortunately, I can't currently make the full shiny app public yet, just the minimal example linked below.

https://github.com/vnijs/quizr

HenrikBengtsson commented 3 years ago

Based on my initial review it seems that this will all work nicely with future.callr. But then I noticed that you also have future.batchtools which has a batchtools_local option. Are there any important differences between future.callr and future.batchtools that might push me to choose one over the other?

No need to use future.batchtools. I'd say you could also skip future.callr, and just use built-in plan(multisession). That avoids any extra package dependencies. From the "outside" they all behave the same because they are all compliant with the Future API.

The last thing I wanted to ask about was RAppArmor. ...

I have no experience with RAppArmor, so I can't tell. But, all of the above future backends run in external R processes and have globals exported to the R processes, so, roughly, they all share the same pro's and con's.

vnijs commented 3 years ago

Finally got around to using future in my project and I love it!

I do have a question about how future does (not) remember/cache results from a prior run when I use plan(multicore). The screenshot below is from a shiny+knitr app where students enter and run code to complete an assignment. Some of the questions are connected so that, for example, Question 0.02 can use variables created in Question 0.01. In this example that connection happens by simply combining the code from the 2 different questions and running it.

So here is my question: If I run Q1 with the x <- 3 line uncommented the list of objects as c("pre_sql", "x"). If I then then run Q2, go back to Q1 and comment out the x <- 3 line and re-run the code I will get c("pre_sql", "x", "y", "z") when I use plan(multicore) but only c("pre_sql") when I use plan(multisession). FYI "pre_sql" is part of the environment used by design and is defined previously. I'd like to use plan(multicore) since you mention that it can be more efficient on non-windows system but I would prefer to have code re-run in a clean environment each time.

Is there a way to turn off memory/caching in plan(multicore) or am I missing something? An example code chunk from my app is shown below in case that helps. FYI The knit_it function mentioned just does some student-submitted-code editing/combining and then knits that code into an html file.

Interestingly, I have the same issue with both plan(multicore) and plan(multisession) when the coding challenges are in python, using reticulate.

image

            future::future({
              if (type == "python") library(reticulate)
              html <- knit_it(code, allow = allow, type = type, include_code = include_code, envir = envir, checks = checks)
              tagList(
                br(),
                html
              )
            }, globals = list(
              knit_it = knit_it,
              code = code,
              allow = r_ssuid %in% getOption("eval_code", default = "nobody"),
              include_code = include_code,
              type = type,
              envir = envir,
              checks = checks,
              is_empty = radiant.data::is_empty,
              tagList = shiny::tagList,
              br = shiny::br,
              HTML = shiny::HTML,
              `%>%` = dplyr::`%>%`
            ), seed = TRUE)
vnijs commented 3 years ago

Looks like the answer to my question is contained in the vignettes for future.callr. There is indeed a noticeable delay in future.callr compared to future with plan(multisession). I tried future.callr in my application running in a docker container from Rstudio and it works fine. For some reason, however, the call to knit_it in the example I shared previously consistently fails with future.callr and plan(callr) on our linux server while the same exact code runs fine with plan(multisession). I added the error messages from the logs below but they are uninformative, at least to me. If you have any suggestions on how I might debug this issue please let me know.

https://cran.r-project.org/web/packages/future.callr/vignettes/future.callr.html

"When using callr futures, each future is resolved in a fresh background R session which ends as soon as the value of the future has been collected. In contrast, multisession futures are resolved in background R worker sessions that serve multiple futures over their life spans. The advantage with using a new R process for each future is that it is that the R environment is guaranteed not to be contaminated by previous futures, e.g. memory allocations, finalizers, modified options, and loaded and attached packages. The disadvantage, is an added overhead of launching a new R process. (At the moment, I am neither aware of formal benchmarking of this extra overhead nor of performance comparisons of callr to alternative future backends.)"

  82: stop
  81: <Anonymous>
  80: onFulfilled
  78: onFulfilled
  76: onFulfilled
  74: func
  69: contextFunc
  68: env$runWith
  61: ctx$run
  60: onFulfilled
  59: onFulfilled
  57: onFulfilled
  55: onFulfilled
  54: onFulfilled
  52: onFulfilled
  50: onFulfilled
  49: func
  44: contextFunc
  43: env$runWith
  36: ctx$run
  35: onFulfilled
  34: onFulfilled
  24: f
  23: FUN
  22: lapply
  21: <Anonymous>
From earlier call:
  127: domain$wrapOnFulfilled
  126: promiseDomain$onThen
  125: action
  118: promise
  117: promise$then
  116: then
  115: %...>%
   99: renderPrint
   98: func
   82: origRenderFunc
   81: output$quiz_submit0.01
    1: runApp
vnijs commented 3 years ago

I just upgraded to 0.6.0 of the future.callr package and (1) plan(callr) now works on our Ubuntu 20.04 server and (2) evaluating code seems quite a faster than before. Closing this issue. Thanks again for the excellent future packages!

HenrikBengtsson commented 3 years ago

Thank you and thanks for reporting back. Good to hear it works for you now. FYI, I don't see anything in future.callr 0.5.0 (2019-09-27) -> 0.6.0 (2021-01-02) that would make a difference. I'm quite sure it must have been some other updates or something else that caused it to start working for you.