mpi: add test that physical cores are not oversubscribed

wihobbs commented 8 months ago

Problem: A common failure in some MPI implementations is that all the MPI ranks end up packed on the same core. This usually shows up if pmix is configured improperly. It causes lock contention, achieves no parallelization, and is bad, so we want to catch if that's happening.

Add a test to our CI that collects the physical cpu number for each MPI rank, then checks that no two are the same. Note that this will fail if cores are deliberately oversubscribed (but they shouldn't be in CI.

This is work-in-progress. A few things left to do:

[x] vcpu numbers can be the same if you're running across two nodes. So each node needs to have its own array where it checks for collisions.
[x] Currently, every rank in the MPI task checks to make sure there are no collisions. This is probably wasteful, although those cores would be sitting idle anyway, so maybe it's not.
[ ] We want to test an MPI_Allreduce (even though that's already tested in hello.c) but currently MPI_Allgather is used.
[ ] Memory check -- run under valgrind and make sure memory is being allocated/freed properly.
[x] Clean-up of formatting to be consistent with Flux standards.

garlick commented 8 months ago

I'm fuzzy on what the actual failure mode is here and what it might have to do with pmix? Can you elaborate or is there an issue to point to?

wihobbs commented 8 months ago

I'm fuzzy on what the actual failure mode is here and what it might have to do with pmix? Can you elaborate or is there an issue to point to?

A failure mode might have output that looks like this:

Hello from local rank 0 (global rank 0) on corona244 vcpu 47
Hello from local rank 1 (global rank 1) on corona244 vcpu 47
Hello from local rank 2 (global rank 2) on corona244 vcpu 47
Hello from local rank 3 (global rank 3) on corona244 vcpu 47

On corona, where there are 48 cores per node, it shouldn't be packing 4 MPI ranks on one core by default. (Note that I haven't actually ever seen this happen, I made that output up.)

If I try and force this behavior, it ends up looking like this:

 flux run --cores=1 --tasks-per-core=4 -o verbose=2 ./vcpu
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 SPLIT_TYPE FROM 0
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
PMI_Abort: (0) N/A
Hello from local rank 0 (global rank 0) on corona244 vcpu 47
33.249s: flux-shell[0]: FATAL: doom: rank 0 exited and exit-timeout=30s has expired

Re: what it might have to do with pmix...apparently, if pmix is configured improperly, it can have behavior similar to what I showed above -- packing a ton of MPI ranks on one core instead of spreading them out like it should. Again, I haven't ever seen it do this, but this was something @nhanford suggested we test for in case that behavior ever pops up.

One thing worth mentioning is that we considered using this code instead of the mpi hello-world test from flux-core (originally, that's how it was written) but @grondo suggested keeping it as a separate test to give us a broader testsuite.

Does that help? Sorry if it doesn't, I'm not an expert on this myself...

nhanford commented 8 months ago

Hello @garlick, This is a quick way of categorically ruling out Open MPI failing over to singleton mode when PMIx is misconfigured. Also, TCE-32 on the internal LC Jira shows an example along similar lines where affinity was being set incorrectly. However, I don't know if we ever ruled out PMIx on that one, but iti is still worthwhile to detect discrepancies between what Flux set for affinity and where processes actually ended up, IMHO.

wihobbs commented 8 months ago

This is a quick way of categorically ruling out Open MPI failing over to singleton mode when PMIx is misconfigured.

Mmmm I think because of the localRank communicator split this might not pick up if they all show up as a bunch of singletons....this is back to WIP while I investigate that.

garlick commented 8 months ago

Maybe a simpler test to detect multiple singletons would be to just have rank 1 print its rank and if it's 1 pass, and if 0 fail?

Edit: Duh, that wouldn't work - but maybe just have it fail if the size is 1 and always run > 1?

Why the communicator split?

garlick commented 8 months ago

p.s. if @nhanford is on board here, this can be merged. I was just trying to understand.

wihobbs commented 8 months ago

I think we need to keep this test, since it's still looking at a behavior that could be problematic (core oversubscription when that shouldn't happen). However, I think we need an additional test to detect if multiple singletons are being created, because this test won't pick that up as its currently designed.

One smoke test would be to have the rank 0 MPI process create a new file in /tmp/fluxci and call MPI_Abort if that file has already been created. Then, if there were multiple rank 0 instances the test would fail.

Thanks @garlick and @nhanford for the comments, you made me realize I had misunderstood what behavior we were looking for!

garlick commented 8 months ago

A file check would work but it would be easier and equivalent to fail if the mpi size is 1. (Each singleton will think it is rank 0 of a parallel program of size 1)

On Wed, Feb 28, 2024, 8:46 AM William Hobbs @.***> wrote:

I think we need to keep this test, since it's still looking at a behavior that could be problematic (core oversubscription when that shouldn't happen). However, I think we need an additional test to detect if multiple singletons are being created, because this test won't pick that up as its currently designed.

One smoke test would be to have the rank 0 MPI process create a new file in /tmp/fluxci and call MPI_Abort if that file has already been created. Then, if there were multiple rank 0 instances the test would fail.

Thanks @garlick https://github.com/garlick and @nhanford https://github.com/nhanford for the comments, you made me realize I had misunderstood what behavior we were looking for!

— Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-test-collective/pull/17#issuecomment-1969405203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJPW36W7LGVZQZAFK2TKLYV5NOZAVCNFSM6AAAAABDVYHGYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRZGQYDKMRQGM . You are receiving this because you were mentioned.Message ID: @.***>

wihobbs commented 8 months ago

Why the communicator split?

I'm going to explain what I thought we were trying to test for in vcpu.c and maybe that will clear up why this was designed in this way.

I thought we were looking for output that looked like this:

Hello from local rank 0 (global rank 0) on corona244 vcpu 47
Hello from local rank 1 (global rank 1) on corona244 vcpu 47
Hello from local rank 0 (global rank 2) on corona243 vcpu 46
Hello from local rank 1 (global rank 3) on corona243 vcpu 46

where it had unnecessarily packed multiple MPI ranks onto one core instead of spreading them out. In order to detect this, I had each rank fetch its rank number and sched_getcpu number and write this to stdout, then hit a barrier, then report all the sched_getcpu numbers to an array via an MPI_Allgather call (this has the added benefit of testing point-to-point and all-to-all MPI communication, which we're already testing anyway, but I digress).

The rank 0 MPI task then iterated over that array to check that each one was unique. This worked well for detecting errors on one node, but failed when I tried to run it across two, because the sched_getcpu numbers can be duplicated -- two nodes have two of core # 1, 2, 3, ..., 96.

In order to solve that, I split the communicators by node, and had each node iterate over its own array to check that no two sched_getcpu numbers were the same. If an overlap was detected, it would call MPI_Abort.

When I said:

Mmmm I think because of the localRank communicator split this might not pick up if they all show up as a bunch of singletons

I was actually wrong; this test won't pick up singletons, but it has nothing to do with the communicator split. This test won't pick up singletons because it relies on the assumption that all the ranks initialized when the test starts will report to each other their core number, which isn't the case if they're singletons (silly Hobbs). (And even if they could all report their core number, they might still be unique.) So, in the case of n singletons being created, you'll have n arrays of size 1 containing 1 sched_getcpu number, which will all pass the uniqueness test and exit with a 0 return code.

wihobbs commented 8 months ago

A file check would work but it would be easier and equivalent to fail if the mpi size is 1.

Nice idea. I'll make this in a new PR. If we're good to keep this test, I'm removing WIP.

wihobbs commented 7 months ago

A lot of learning happened in this one. Thanks @grondo @garlick and @nhanford for all the help!

wihobbs commented 7 months ago

(and fun too 🎉 )

flux-framework / flux-test-collective

mpi: add test that physical cores are not oversubscribed #17