facebookincubator / gloo

Collective communications library with various primitives for multi-machine training.
Other
1.23k stars 303 forks source link

Support building with OpenSSL 3.x #358

Open geofft opened 1 year ago

geofft commented 1 year ago

OpenSSL 1.x reaches end-of-life in September, and recent distros like Ubuntu 22.04+ (last year) and Debian 12+ (next month) ship only OpenSSL 3.

I have gloo (inside PyTorch) working with OpenSSL 3.x as far as I can tell everything works fine. The APIs it uses are both API- and ABI-compatible between 1.1 and 3.x. (This is important because PyTorch configures gloo with USE_TCP_OPENSSL_LOAD, i.e., it dlopens the library instead of compiling against it.) But there are a few things to adjust:

  1. In #306 cmake does find_package(OpenSSL 1.1 REQUIRED EXACT), which fails out on 3.0. Something like find_package(OpenSSL 1.1...<4.0 REQUIRED) would be better. Alternatively, perhaps this shouldn't be invoked at all in the USE_TCP_OPENSSL_LOAD case, since OpenSSL isn't needed at build time then?
  2. gloo/transport/tcp/tls/openssl.cc attempts to dlopen libssl.so, if present, else libssl.so.1.1. The first library is only available if the development package for OpenSSL is installed. And the development package can be any version (3.x, 4.x, etc.) It's probably safer to make this libssl.so.1.1 + libssl.so.3 (all 3.x uses the same soname).

If a PR is helpful I can do the CLA dance but hopefully this is simple enough that the more interesting thing is agreeing on what the change is.

thesamesam commented 1 year ago

OpenSSL 1.1.x is EOL on 2023-09-11.

@xunnanxu could you take a look please?

xunnanxu commented 1 year ago

Seems pretty reasonable to me. That said I'm probably not the exact right person for review. Maybe consider opening a linked issue in Pytorch code based and tag it with oncall: distributed to make sure this gets properly reviewed?