HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
949 stars 83 forks source link

Even less dependencies? #374

Open tdeenes opened 4 years ago

tdeenes commented 4 years ago

One of the many great ideas of the future package is that it tries to be as lightweight as possible, and provide only the minimal toolset which implements the Future API.

Currently, the package imports the following packages:

The 'globals' package is not absolutely necessary, because the user can define the global variables herself (which is even more robust than using the heuristics of 'globals'). The listenv package does not seem to be absolutely necessary, either. Hashing the objects seems unavoidable, but 'digest' is probably a bit of on overkill for that purpose. (For example 'fastdigest' is faster and much thinner, but also less actively maintained and tested by thousands of users.)

So taken together, it is probably too late to get rid of those dependencies, but a 'future.api' package could be created which has literally no dependencies outside of base R packages, and provides all crucial building blocks of the Future API.

@HenrikBengtsson, what do you think?

HenrikBengtsson commented 4 years ago

Thanks for this. I happy to see this - it looks like you have identified the essence of the Future API and how I see its role in the R ecosystem. Some comments:

When I started this project, I debated whether I should release it as two packages: 'future' and 'future.parallel', where the 'future' package would provide a lightweight, non-functional Future API and 'future.parallel' would implement the parallel backends that we can get from the 'parallel' package. I went back and forth and settled on one merged package 'future' because (a) 'parallel' comes with all R installation and (b) it would be less confusing for beginners, e.g. should the do library(future) or library(future.parallel), or both?

Later, I have indeed played with the idea of pulling out a 'future.api' package from the current 'future', which then pretty much serves the above ideas of a 'future' and 'future.parallel' package. I even spent a fair amount of time to try to implement this migration. At least at the time, I ended up in circular dependencies that would make it really complicated to roll out this change on CRAN. I didn't shut down the idea but I left it on my backburner.

Having a pure 'future.api' package that provides basically an abstract implementation of the Future API, mostly consisting of help pages, generic functions, and some helper functions would help clarify what futures are all about. In my dreams, having such a thin Future API might increase the chances for it to one day make it into base R. The way I think of it, is that the 'parallel' package could build on top of that and not the other way around, e.g. parallel::mcparallel() is basically a future() call and parallel::mccollect() is basically a value() value.

To your specifics about the existing dependencies:

Now, could 'listenv' and 'globals' be listed under 'Suggests:'? Possibly. I need to think about this but a major disadvantage would be that install.packages("future") would not install those by default and then thinks would not work for people out of the box, causing confusions and so on. The take on this might be very different if there's ever a standalone 'future.api' that 'future' lives on top.

BTW, having a pure 'future.api' would drop the dependencies on the 'parallel' package too.

Having said all of the above, what's your need/background for this issue/feature request/discussion?

tdeenes commented 4 years ago

Thank you Henrik for the very detailed response. It seems we are on the same track :)

As for your notes:

Now, could 'listenv' and 'globals' be listed under 'Suggests:'? Possibly. I need to think about this but a major disadvantage would be that install.packages("future") would not install those by default and then thinks would not work for people out of the box, causing confusions and so on.

I think future is used by so many packages and users nowadays that such a change would cause a havoc. This is why I suggested the introduction of future.api as a separate package that future would import.

What's my need/background for this issue/feature request/discussion?

  1. In general:

1.1 The less dependencies the better: I do favor the tinyverse approach. In production context, any extra dependency just causes additional pain - if I do specify the global variables "by hand", why shall I depend on the globals package? (IMPORTANT: It might seem that I do not like the globals package. On the opposite, I think it is a great package which works pretty well in the majority of my use cases. The point is that even if a package is well written and useful in general, it should not be added as a dependency of an other package unless it is absolutely necessary. Just think of the tidyverse packages which cross-reference each other for tiny functionalities.)

1.2 The more granular the better: Unless the underlying code base does not dictate otherwise, an R package should focus on one particular problem. So instead of having one package with 5 broad functionalities, it is better to have 5 separate packages for each functionality, plus maybe an extra "umbrella" package which is for the lazy users (c.f. library(tidyverse)). This might contradict 1.1, but the point is that a package developer or a power user does not want to depend on a code base which she/he does not ever use. So the number of dependency packages might increase, but the total complexity of the dependencies will be much lower.

  1. In particular:

As you just mentioned, I hope future could be part of base R. This would be awesome, and since the core API is so great both in terms of usability and flexibility, and the code base itself is just so R-ish (no tricks, no C magic, so seemingly "easy" to maintain), and the functionality it provides is so important nowadays, this hope might become a reality if all "extras" can be stripped off. I know that the R Core Team is very hesitant to include a user-contributed package in base R, but maybe this particular package could be an exception.

In addition - currently just for fun, but motivated by some internal packages that I had to develop in a production setup - I am playing with the idea of a package or rather related packages which would take the "deferred execution" idea to the extreme. Futures, caching, reactive expressions etc. are just different sides of the same coin: I want to run this specific code (with all nested calls, R and system dependencies etc. included) on this particular data (which can be of course the result of a call on an other set of data). I want to invalidate the result and re-run the code if either the code or the data changes.

If I want to be totally general, I have to even define how to recreate the computational context (e.g., spin up a docker container with all versioned system and package dependencies), and serialize the whole data and recreate the whole call chain. But if the user knows the context, she can make shortcuts - the package shall not control that part, the package shall just provide the possibility to do it. Exactly as you do in future or in progressr: it is the user who defines the plan, the globals to be exported etc.

Totally pure future.api

If future.api becomes a real package, I would suggest to include the basic implementation as well. So basically

library(future.api)
plan("sequential")
x <- future(log(1)) # log is in 'base'
value(x)

would just work and return 0.

Having a basic implementation included in future.api might make it possible that someone who has to run a code which includes futures but does not want to install all the required package ecosystem to fully exploit the functionality of future, needs to load future.api and nothing more.