Closed HenrikBengtsson closed 1 year ago
We won't introduce a dependency on parallelly, even though it provides useful functionality.
You're welcome to copy the source https://github.com/HenrikBengtsson/parallelly/blob/develop/R/availableConnections.R
While 'NCONNECTIONS' might not change a lot over time, I think it might be worth adding a unit test for it.
In the test, we can try to create .snowCoresMax()+1
connections and see if it fails. If not, it means this hardcoded number is not accurate anymore. By doing that we would not need to keep track of the change from the R side.
@mtmorgan If you agree, I can create this unit test
I believe the last time this was discussed in R-core the alternative consideration was something dynamic -- I think (this is just a recollection) that the current implementation allocates a fixed-size vector of connection struct
s at the C level, an alternative being to maintain a hash or similar. Some kind of fancier data structure would be necessary, because currently I believe code in R processes connections by walking the vector, so performance scales linearly in the size of the connection pool. So likely at the next change in connection limits the pool will be 'infinite' subject to system constraints.
I'm also nervous about the side effects of using all the connections -- will R clean up appropriately, or will our test create some kind of race condition or otherwise where R thinks the connection pool is exhausted?
Is NCONNECTIONS available in C? I mean in a way consistent with the API, nothing devious. The unit test could just check that it is defined (if not, then we need to see what changes have been made in R!), and the code could be modified to use the detected value rather than hard-coding 128.
NCONNECTIONS
is a macro defined in the R source code, it is not in the header files so it is not visible to us.
@HenrikBengtsson method also exhausts all connections to find the max limit, so I guess using all connections is not as bad as we thought. Another "safer" but less stable method is to check the help page in the unit test
> db <- tools::Rd_db("base")
> helptext <- db["connections.Rd"]
> grepl("A maximum of 128 connections", helptext)
[1] TRUE
If R core wants to make any fancy change to the connections, they ought to change the document as well...
I guess I was thinking that one can find the number of elements in a C array https://stackoverflow.com/a/10283391/547331 and we know the address of the first element via getConnection(0)
and the size of a connection pointer, so maybe the pieces can be put together... Also I see at the top of R_exts/Connection.h
#define R_CONNECTIONS_VERSION 1
which would seem to be a good thing to check...
I'm also thinking of the use case where someone increases C macro NCONNECTIONS
when building R. We're thinking of doing that at our UCSF HPC environments, because there we now have several machines with 128 and also 192 cores. Bumping up NCONNECTIONS
allows users to run larger, localhost PSOCK cluster. It will also allow users to run larger multi-host PSOCK clusters, which is already hitting the limit for HPC users today.
About stability: freeConnections()
has been in use by makeClusterPSOCK()
since v1.23.0 (2021-01-04). If there are too few free connections, makeClusterPSOCK()
detects this immediately, before even attempting to create workers. The very common future plan(multisession)
and plan(cluster, ...)
backends use parallelly::makeClusterPSOCK()
. Thus far, I haven't heard of any issues related to the availableConnections()
/freeConnections()
strategy for inferring NCONNECTIONS
. FWIW, parallelly is among the top 1-1.5% most downloaded CRAN packages,
I tried to track all R-devel discussions about NCONNECTIONS
in https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28.
FYI, R-devel (>= 4.4.0) gained argument --max-connections=N
so soon it can be increased to whatever the user prefers, cf. https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28#issuecomment-1568786005
Thanks @HenrikBengtsson I know that availableConnections()
discovers this but scanning the R-core commit it does not look like the maximum number of connections is exposed to the user directly or indirectly?
... scanning the R-core commit it does not look like the maximum number of connections is exposed to the user directly or indirectly?
Correct. It would be great to have, and I'd expect this to become a common feature request for base R at some points (I just don't have time to push for that myself right now).
I also think it would be great to be able to control the default --max-connections
value via an environment variable, e.g. in ~/Renviron
or globally by a sysadm in /etc/profile.d/
. That would spare everyone lots of FAQs, and simplify documentation needed.
FYI, instead of:
https://github.com/Bioconductor/BiocParallel/blob/c1b26bc64e114d3453c9ebb5ec4ea95e80cf2f4b/R/SnowParam-class.R#L65
you can use
parallelly::freeConnections()
. It does not make assumptions about theNCONNECTIONS=128
limit that R currently has, i.e. it'll also work if someone builds R with, say,NCONNECTIONS=1024
(https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28). I'd expect R to increases this hardcoded limit one of these days, given that machines with 192 and 256 cores are getting more common.