ComputeCanada / software-stack-config

8 stars 3 forks source link

Disable the new ofi (libfabric) btl on omnipath/ethernet in Open MPI #34

Closed bartoldeman closed 2 years ago

bartoldeman commented 2 years ago

This btl speeds up one-sided communication (MPI_Put, MPI_Get, and co) but allocates a second separate hardware context per process on omnipath.

So for instance, on 48-core Cedar nodes, once you use more than 24 MPI processes you run out of contexts (2x25=50 > 48 available contexts)

See https://github.com/open-mpi/ompi/issues/9575

An alternative for Cedar would be to set num_user_contexts=96 in /etc/modprobe.d/hfi1.conf (in general, 2x the number of cores per node) if that happens, we may be able to revert this change.

mboisson commented 2 years ago

I would like to get a person from Cedar to comment on this.

bartoldeman commented 2 years ago

A little bit more background:

Open MPI at a highest level selects a PML (point-to-point message layer) for point-to-point communication.

On CC clusters we generally use one of three PMLs:

  1. ucx (UCX library https://openucx.org/) on IB clusters
  2. cm (a Highlander movie reference to Connor MacLeod: “there can only be one”), which in turn selects an mtl (matching transport layer), and there can only be one mtl . Cedar uses cm and the psm2 mtl by default which talks directly to the lower level opa-psm2 library, but you can instead use ofi (libfabric as an intermediate between Open MPI and opa-psm2) if you like.
  3. ob1 (a Star Wars movie reference to Obiwan Kenobi), which in turn selects more than one (Obiwan is not MacLeod) btl (byte transfer layer). This is what is used with ethernet (using self and tcp BTLs) or on a single node without high speed interconnect (using self and vader BTLs, yes another Star Wars reference), or with some extra configuration can use the now deprecated openib BTL which is sometimes useful with DDT for visualizing message queues on Graham)

Now, to add some further complication Open MPI 4.1 introduced a new ofi (libfabric) BTL, which is not used by any PML, instead it used by the osc framework, where osc stands for one-sided communication (MPI_Get, MPI_Put and co, not used a lot in the wild as far as I know, and we never teach those in introductory MPI courses).

Even if those MPI functions are never used, the ofi BTL still initializes and reserves a second additional hardware context per rank (the first one comes from opa-psm2, either via the psm2 or the ofi MTL). As there is one context per core available on Cedar, you run out of them as soon as you have more than n/2 ranks per node.

This can be fixed by either disabling the ofi BTL (as proposed in this PR, which basically equalizes with Open MPI 4.0 where this BTL isn't available, and I believe one-sided comms may go over the ethernet but don't quote me on that), or by increasing the number of hardware contexts on Cedar nodes ( e.g. num_user_contexts=96 in /etc/modprobe.d/hfi1.conf for nodes with 48 cores)

Now why does this only turn up now, in my testing of Open MPI 4.1.4 (not yet available as a module in the prod repo) and not with the existing 4.1.1 module you get with module load gcc/10 openmpi/4.1.1)? libfabric 1.12.1 inadvertently uses psm3 instead of psm2 because of symbol collisions: https://github.com/ofiwg/libfabric/issues/7757 psm3 provides a psm2-style API but instead uses RoCE (ethernet) so it's actually using the ethernet, and the ofi BTL in Open MPI 4.1.1, via libfabric 1.12.1, uses the ethernet on Cedar, and no context is reserved.

mboisson commented 2 years ago

Agreed with the Cedar team.