Open tdeenes opened 4 years ago
Thanks for this. I happy to see this - it looks like you have identified the essence of the Future API and how I see its role in the R ecosystem. Some comments:
When I started this project, I debated whether I should release it as two packages: 'future' and 'future.parallel', where the 'future' package would provide a lightweight, non-functional Future API and 'future.parallel' would implement the parallel backends that we can get from the 'parallel' package. I went back and forth and settled on one merged package 'future' because (a) 'parallel' comes with all R installation and (b) it would be less confusing for beginners, e.g. should the do library(future)
or library(future.parallel)
, or both?
Later, I have indeed played with the idea of pulling out a 'future.api' package from the current 'future', which then pretty much serves the above ideas of a 'future' and 'future.parallel' package. I even spent a fair amount of time to try to implement this migration. At least at the time, I ended up in circular dependencies that would make it really complicated to roll out this change on CRAN. I didn't shut down the idea but I left it on my backburner.
Having a pure 'future.api' package that provides basically an abstract implementation of the Future API, mostly consisting of help pages, generic functions, and some helper functions would help clarify what futures are all about. In my dreams, having such a thin Future API might increase the chances for it to one day make it into base R. The way I think of it, is that the 'parallel' package could build on top of that and not the other way around, e.g. parallel::mcparallel()
is basically a future()
call and parallel::mccollect()
is basically a value()
value.
To your specifics about the existing dependencies:
'digest': The lowest hanging fruit is definitely the dependency on the 'digest' package. I agree, for what it is used it's "lots" to bring in. I decided on it until a better solution existed. On the upside, most users do have it installed already for other reasons; it's one of the most downloaded packages we have out there. The best solution would be to have a simple checksum function in base R. Believe it or not, we do have tools::md5()
since many many years. Unfortunately, it only takes files as input. It does not take an in-memory object as input, and not even a connection. If we could get R Core to expose this algorithm for non-files that would be great. This is why I have created https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21. I just haven't had the energy to create a solid proposal to R Core/the R-devel list. If you have the extra time, please feel free to pick up that ball.
'listenv': Yes, this is only needed to support "future assignments" via the y %<-% expr
operator. This operator is one teeny step up from the basic building blocks future()
, resolve()
and value()
. So, if there one day will a 'future.api' package, I'm not sure if %<-%
will be part of that or not. A lot of these decision is based on tradeoff of purity/design philosophy and practicality.
'globals': This one is harded. If this becomes a soft dependency, it opens up for alternative solutions to identify globals automagically. Maybe that is a good thing, but I don't think it is at this stage. If futures start behaving differently depending on what method is used for identifying globals, things are quickly going to go south. In constrast, the 'foreach' framework is slowly harmonizing toward 'future' on how globals are identified (https://github.com/RevolutionAnalytics/foreach/issues/2), which I think benefits the whole parallel community.
Now, could 'listenv' and 'globals' be listed under 'Suggests:'? Possibly. I need to think about this but a major disadvantage would be that install.packages("future")
would not install those by default and then thinks would not work for people out of the box, causing confusions and so on. The take on this might be very different if there's ever a standalone 'future.api' that 'future' lives on top.
BTW, having a pure 'future.api' would drop the dependencies on the 'parallel' package too.
Having said all of the above, what's your need/background for this issue/feature request/discussion?
Thank you Henrik for the very detailed response. It seems we are on the same track :)
digest: Yes, I am aware of the existence of tools::md5sum
, I even use it in some of my internal packages. I was also surprised that tools::md5
does not exist - your wishlist entry is a nice summary of the issue. Unfortunately I am not well versed in C enough to make such a contribution on my own; however, by looking at the source code, it is really hard to see why only the file-based API is exposed, since all building blocks seem to be already implemented. Nevertheless, I can try to send an e-mail to R-devel and ask what the reason can be.
listenv: I consider %<-%
as a utility feature - definitely not a must-have in a low-level API.
globals: I understand your concerns, but globals
has already three strategies ("ordered", "conservative", "liberal") to identify global variables. Note also that unfortunately a lot of packages use options
in a way which affects the return value of an expression. Since options and environment variables are not exported by default in any of the globals
methods (btw, this might worth an issue in the globals
package), the result of a future might depend on which plan is used ("multicore" returns the same as "sequential", the other ones do not).
Now, could 'listenv' and 'globals' be listed under 'Suggests:'? Possibly. I need to think about this but a major disadvantage would be that install.packages("future") would not install those by default and then thinks would not work for people out of the box, causing confusions and so on.
I think future
is used by so many packages and users nowadays that such a change would cause a havoc. This is why I suggested the introduction of future.api
as a separate package that future
would import.
1.1 The less dependencies the better: I do favor the tinyverse approach. In production context, any extra dependency just causes additional pain - if I do specify the global variables "by hand", why shall I depend on the globals
package? (IMPORTANT: It might seem that I do not like the globals
package. On the opposite, I think it is a great package which works pretty well in the majority of my use cases. The point is that even if a package is well written and useful in general, it should not be added as a dependency of an other package unless it is absolutely necessary. Just think of the tidyverse packages which cross-reference each other for tiny functionalities.)
1.2 The more granular the better: Unless the underlying code base does not dictate otherwise, an R package should focus on one particular problem. So instead of having one package with 5 broad functionalities, it is better to have 5 separate packages for each functionality, plus maybe an extra "umbrella" package which is for the lazy users (c.f. library(tidyverse)
). This might contradict 1.1, but the point is that a package developer or a power user does not want to depend on a code base which she/he does not ever use. So the number of dependency packages might increase, but the total complexity of the dependencies will be much lower.
As you just mentioned, I hope future
could be part of base R. This would be awesome, and since the core API is so great both in terms of usability and flexibility, and the code base itself is just so R-ish (no tricks, no C magic, so seemingly "easy" to maintain), and the functionality it provides is so important nowadays, this hope might become a reality if all "extras" can be stripped off. I know that the R Core Team is very hesitant to include a user-contributed package in base R, but maybe this particular package could be an exception.
In addition - currently just for fun, but motivated by some internal packages that I had to develop in a production setup - I am playing with the idea of a package or rather related packages which would take the "deferred execution" idea to the extreme. Futures, caching, reactive expressions etc. are just different sides of the same coin: I want to run this specific code (with all nested calls, R and system dependencies etc. included) on this particular data (which can be of course the result of a call on an other set of data). I want to invalidate the result and re-run the code if either the code or the data changes.
If I want to be totally general, I have to even define how to recreate the computational context (e.g., spin up a docker container with all versioned system and package dependencies), and serialize the whole data and recreate the whole call chain. But if the user knows the context, she can make shortcuts - the package shall not control that part, the package shall just provide the possibility to do it. Exactly as you do in future
or in progressr
: it is the user who defines the plan, the globals to be exported etc.
If future.api
becomes a real package, I would suggest to include the basic implementation as well. So basically
library(future.api)
plan("sequential")
x <- future(log(1)) # log is in 'base'
value(x)
would just work and return 0.
Having a basic implementation included in future.api
might make it possible that someone who has to run a code which includes futures but does not want to install all the required package ecosystem to fully exploit the functionality of future, needs to load future.api
and nothing more.
One of the many great ideas of the future package is that it tries to be as lightweight as possible, and provide only the minimal toolset which implements the Future API.
Currently, the package imports the following packages:
The 'globals' package is not absolutely necessary, because the user can define the global variables herself (which is even more robust than using the heuristics of 'globals'). The listenv package does not seem to be absolutely necessary, either. Hashing the objects seems unavoidable, but 'digest' is probably a bit of on overkill for that purpose. (For example 'fastdigest' is faster and much thinner, but also less actively maintained and tested by thousands of users.)
So taken together, it is probably too late to get rid of those dependencies, but a 'future.api' package could be created which has literally no dependencies outside of base R packages, and provides all crucial building blocks of the Future API.
@HenrikBengtsson, what do you think?