Bioconductor / BiocParallel

Bioconductor facilities for parallel evaluation
https://bioconductor.org/packages/BiocParallel
65 stars 29 forks source link

parallelly::freeConnections() infers number of non-used connections (not hardcoded based on 128) #230

Closed HenrikBengtsson closed 1 year ago

HenrikBengtsson commented 1 year ago

FYI, instead of:

https://github.com/Bioconductor/BiocParallel/blob/c1b26bc64e114d3453c9ebb5ec4ea95e80cf2f4b/R/SnowParam-class.R#L65

you can use parallelly::freeConnections(). It does not make assumptions about the NCONNECTIONS=128 limit that R currently has, i.e. it'll also work if someone builds R with, say, NCONNECTIONS=1024 (https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28). I'd expect R to increases this hardcoded limit one of these days, given that machines with 192 and 256 cores are getting more common.

mtmorgan commented 1 year ago

We won't introduce a dependency on parallelly, even though it provides useful functionality.

HenrikBengtsson commented 1 year ago

You're welcome to copy the source https://github.com/HenrikBengtsson/parallelly/blob/develop/R/availableConnections.R

Jiefei-Wang commented 1 year ago

While 'NCONNECTIONS' might not change a lot over time, I think it might be worth adding a unit test for it.

In the test, we can try to create .snowCoresMax()+1 connections and see if it fails. If not, it means this hardcoded number is not accurate anymore. By doing that we would not need to keep track of the change from the R side.

@mtmorgan If you agree, I can create this unit test

mtmorgan commented 1 year ago

I believe the last time this was discussed in R-core the alternative consideration was something dynamic -- I think (this is just a recollection) that the current implementation allocates a fixed-size vector of connection structs at the C level, an alternative being to maintain a hash or similar. Some kind of fancier data structure would be necessary, because currently I believe code in R processes connections by walking the vector, so performance scales linearly in the size of the connection pool. So likely at the next change in connection limits the pool will be 'infinite' subject to system constraints.

I'm also nervous about the side effects of using all the connections -- will R clean up appropriately, or will our test create some kind of race condition or otherwise where R thinks the connection pool is exhausted?

Is NCONNECTIONS available in C? I mean in a way consistent with the API, nothing devious. The unit test could just check that it is defined (if not, then we need to see what changes have been made in R!), and the code could be modified to use the detected value rather than hard-coding 128.

Jiefei-Wang commented 1 year ago

NCONNECTIONS is a macro defined in the R source code, it is not in the header files so it is not visible to us.

@HenrikBengtsson method also exhausts all connections to find the max limit, so I guess using all connections is not as bad as we thought. Another "safer" but less stable method is to check the help page in the unit test

> db <- tools::Rd_db("base")
> helptext <- db["connections.Rd"]
> grepl("A maximum of 128 connections", helptext)
[1] TRUE

If R core wants to make any fancy change to the connections, they ought to change the document as well...

mtmorgan commented 1 year ago

I guess I was thinking that one can find the number of elements in a C array https://stackoverflow.com/a/10283391/547331 and we know the address of the first element via getConnection(0) and the size of a connection pointer, so maybe the pieces can be put together... Also I see at the top of R_exts/Connection.h

#define R_CONNECTIONS_VERSION 1

which would seem to be a good thing to check...

HenrikBengtsson commented 1 year ago

I'm also thinking of the use case where someone increases C macro NCONNECTIONS when building R. We're thinking of doing that at our UCSF HPC environments, because there we now have several machines with 128 and also 192 cores. Bumping up NCONNECTIONS allows users to run larger, localhost PSOCK cluster. It will also allow users to run larger multi-host PSOCK clusters, which is already hitting the limit for HPC users today.

About stability: freeConnections() has been in use by makeClusterPSOCK() since v1.23.0 (2021-01-04). If there are too few free connections, makeClusterPSOCK() detects this immediately, before even attempting to create workers. The very common future plan(multisession) and plan(cluster, ...) backends use parallelly::makeClusterPSOCK(). Thus far, I haven't heard of any issues related to the availableConnections()/freeConnections() strategy for inferring NCONNECTIONS. FWIW, parallelly is among the top 1-1.5% most downloaded CRAN packages,

I tried to track all R-devel discussions about NCONNECTIONS in https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28.

HenrikBengtsson commented 1 year ago

FYI, R-devel (>= 4.4.0) gained argument --max-connections=N so soon it can be increased to whatever the user prefers, cf. https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28#issuecomment-1568786005

mtmorgan commented 1 year ago

Thanks @HenrikBengtsson I know that availableConnections() discovers this but scanning the R-core commit it does not look like the maximum number of connections is exposed to the user directly or indirectly?

HenrikBengtsson commented 1 year ago

... scanning the R-core commit it does not look like the maximum number of connections is exposed to the user directly or indirectly?

Correct. It would be great to have, and I'd expect this to become a common feature request for base R at some points (I just don't have time to push for that myself right now).

I also think it would be great to be able to control the default --max-connections value via an environment variable, e.g. in ~/Renviron or globally by a sysadm in /etc/profile.d/. That would spare everyone lots of FAQs, and simplify documentation needed.