HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
953 stars 83 forks source link

WISH: Futurized *apply() functions #21

Closed HenrikBengtsson closed 8 years ago

HenrikBengtsson commented 9 years ago

Should we add something like?

flapply <- function(x, FUN, ..., AS.LIST=FALSE) {
  res <- listenv()
  for (ii in seq_along(x)) {
    res[[ii]] %<=% FUN(x[[ii]], ...)
  }
  names(res) <- names(x)

  ## Test that 'x', 'FUN' and 'ii' are exported to future environment
  rm(list=c("x", "FUN", "ii"))

  ## Return listenv of list of futures - the latter blocks.
  if (AS.LIST) res <- as.list(res)

  res
}

to the future API?

russellpierce commented 8 years ago

It might be prudent to add a warning if the plan is multicore, but otherwise, that would seem useful.

HenrikBengtsson commented 8 years ago

I'm not sure I understand, why would we need a warning if multicore futures are used?

russellpierce commented 8 years ago

Nvm, I see you are silently handling the number of jobs the user spools off so that they don't fork bomb themselves.

HenrikBengtsson commented 8 years ago

Yes, I took a conservative approach and inserting a plan(eager) at the beginning of every multicore expression.

There's also parallel::mcaffinity(); I don't have much experience with it, but maybe support for it (and similar features) should be added at some point. See also https://stat.ethz.ch/pipermail/r-devel/2012-December/065313.html.

I also try future to be agile to number of allocated cores available to the R session, which is not always the same as the maximum number of cores, cf. availableCores(). See https://github.com/HenrikBengtsson/future/issues/22.

HenrikBengtsson commented 8 years ago

Just writing down my thoughts: It could be that there are so many different f*apply() approaches and strategies (e.g. various types of splitting/chunking) that it would make sense to have a separate package on top future, e.g. fplyr.

HenrikBengtsson commented 8 years ago

In addition to lapply(), another common need will be apply() with futures, cf. parallel::parApply() etc. Maybe the following is good enough for now?

fapply <- function(X, MARGIN, FUN, ...) {
  fFUN <- function(...) { future(FUN(...)) }
  res <- apply(X, MARGIN=1L, FUN=fFUN, ...)
  res <- values(res)  ## Efficient collection of values
  sapply(res, FUN=I, simplify=TRUE)
}
HenrikBengtsson commented 8 years ago

Ah, we need to be careful in what we're exporting. Particularly, using

res[[ii]] %<=% FUN(x[[ii]], ...)

or

res[[ii]] <- future(FUN(x[[ii]], ...))

will require that all of x is exported. It's more efficient to subset outside the future expression, i.e.

x_ii <- x[[ii]]
res[[ii]] %<=% FUN(x_ii, ...)

and

x_ii <- x[[ii]]
res[[ii]] <- future(FUN(x_ii, ...))

See https://github.com/ilarischeinin/QDNAseq/pull/1 for real world example.

HenrikBengtsson commented 8 years ago

Should we have a separate vignette on "best practises"?

HenrikBengtsson commented 8 years ago

UPDATE: I've created the doFuture package, which brings future support (pun!) to foreach, which in turn brings full future support to plyr. In other words, any type of future can be used for plyr:ing, e.g.

library("doFuture")
registerDoFuture()
plan(multiprocess)

library("plyr")
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
llply(x, quantile, probs = 1:3/4, .parallel=TRUE)

Obviously, it is unlikely that the *ply() functions are as efficient (memory and speed; mostly memory) as highly customized apply functions that are aware of futures, but this is certainly a good start and it opens up a huge well-established API.

HenrikBengtsson commented 8 years ago

Will restrain from creating a *ply API. For now plyr can be used.