Configuring AMPI hardware resources and worker thread allocation

panops99 commented 1 year ago

I'm running AMPI on Charm 7.0.0 on a linux cluster and I'm trying to configure the resource utilization parameters to get the best performance of AMPI in a way that is also using the same hardware resources as my baseline MPI program. In other words, I'm trying to create an apples-to-apples comparison between AMPI and MPI, with hardware resources being consistent.

In my baseline MPI program, I'm running an MPI rank on each of the hardware threads. For example, my machine has 32 hardware threads per node (with hyperthreads) so I'm able to run 32 MPI ranks per node. I see all of the options in the charm docs, including how many worker threads to use, etc. But, I'm a bit unsure how to configure the job to 1) use the same amount of hardware as my MPI baseline, and 2) make AMPI as fast as it can be.

It's clear from the Charm docs that SMP mode requires at least one worker thread per process for communication. Also, when I try to run in non-SMP mode allocating all cores to physical ranks, Charm throws an error (Reason: GNI_RC_CHECK). When I dial back the number of physical ranks to leave at least one core free, it doesn't throw the error. So I presume that non-SMP mode needs at least one worker thread as well, although I am not sure about this.

Specifically, in non-SMP mode, this command fails: srun --ntasks 32 --nodes 1 app +vp32

In non-SMP mode, this command succeeds: srun --ntasks 31 --nodes 1 app +vp32

In the latter one, I can see in htop that all hardware cores are running something -- presumably Charm is spinning up a worker thread but I'm not sure.

So, my main questions are:

Is there a generally recommended approach here to get the best Charm / AMPI performance for both non-SMP mode and SMP mode?
With non-SMP mode, is there a way to run a physical process per hardware core (i.e., match physical ranks to hardware threads)?

For both questions, it would be helpful to see recommended charmrun parameters to try (or a range of parameters to try that would be allowable by the charm runtime).

stwhite91 commented 1 year ago

SMP mode should provide the best performance, and in SMP mode you need to dedicate 1 thread to handle communication per process. In non-SMP mode, there are no "wasted" extra cores that you must dedicate to communication, however the intra-node communication will perform worse and communication will not be overlapped as effectively as in SMP mode.

Assuming you are trying to run with 32 VPs on 1 node with 32 cores, here's how to run:

SMP: srun --nodes 1 --ntasks 1 --cpus-per-task 32 ./app +vp32 +ppn 31 +pemap 0-30 +commap 31

Non-SMP: srun --nodes 1 --ntasks 32 --cpus-per-task 1 ./app +vp32

Note that in SMP mode, often it is beneficial to launch one process per socket due to NUMA effects. So if you have dual socket nodes with 16 cores per socket, the best performance might be had with:

SMP (dual socket): srun --nodes 1 --ntasks 2 --cpus-per-task 16 ./app +vp32 +ppn 15 +pemap 0-14,16-30 +commap 15,31

Note that for the first non-SMP run command above AMPI will be running 32 VPs on 31 worker threads with 1 communication thread (the dual-socket run command will run 32 VPs on 30 worker threads with 2 communication threads). It will work because AMPI supports rank virtualization, but performance will suffer due to load imbalance from there being 2 ranks on 1 thread and 1 rank on the rest of the threads. It is inherently difficult to compare MPI to AMPI in SMP mode due to the dedicated communication thread consuming a core per process. You can also compare, say, AMPI in SMP mode with +ppn 15 +vp 30 versus MPI with 30 ranks, but it's still not quite apples-to-apples since AMPI is then using extra cores for communication.

Edit: my original response had SMP and non-SMP mixed up! I've updated it with corrections.

panops99 commented 1 year ago

Thanks -- this clarifies a lot. I ran through the various examples you gave. The SMP ones work, but the non-SMP version fails for me (in the same way as I mentioned in the original post):

Non-SMP: srun --nodes 1 --ntasks 32 --cpus-per-task 1 ./app +vp32

Here's the complete error I get:

Charm++> Running on Gemini (GNI) with 32 processes
Charm++> static SMSG
Charm++> SMSG memory: 316.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 2048K
Charm++> Running in non-SMP mode: 32 processes (PEs)
  [31:2] test_app 0x204eb5f2 ConverseInit
  [31:3] test_app 0x2045f592 charm_main
  [31:4] libc.so.6 0x15555299d3ea __libc_start_main
Converse/Charm++ Commit ID: v70000
  [31:5] test_app 0x2011f4ea _start
aborting job:
GNI_RC_CHECK
srun: error: drac11: task 31: Exited with exit code 255

Of course, changing --ntasks to 31 works. Was that a typo in your response or is something wrong on my end?

stwhite91 commented 1 year ago

Not a typo, it should be --ntasks 32 for non-SMP mode so I'm not sure why it is failing. GNI_RC_CHECK is a pretty generic/vague error message. If you rebuild Charm++/AMPI with its --enable-error-checking option (as in ./build AMPI gni-crayxc -j16 --with-production --enable-error-checking -g) you may get a better error message / stack trace.

panops99 commented 1 year ago

As you suggested, I built a debug version of non-SMP AMPI and ran the same program. Here's the error trace:

[7, 0] memory reigstration 0.000301 G (1) ask for 8388608
[n 7 0] sweep_mempool slot START.
[n 7] sweep_mempool slot END.
[7] MEMORY_REGISTER; err=GNI_RC_ERROR_RESOURCE
------------- Processor 7 Exiting: Called CmiAbort ------------
Reason: GNI_RC_CHECK
[7] Stack Traceback:
  [7:0] test_app 0x204fbc10
  [7:1] test_app 0x205455d9 mempool_init
  [7:2] test_app 0x20505747
  [7:3] test_app 0x2050b0e7 ConverseInit
Charm++> Running on Gemini (GNI) with 32 processes
Charm++> static SMSG
Charm++> SMSG memory: 316.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 2048K
Charm++> Running in non-SMP mode: 32 processes (PEs)
Converse/Charm++ Commit ID: v70000
Charm++ built with internal error checking enabled.

This seems to be coming from src/arch/gni/Machine.C, around line 3858. There are a number of environment variables listed at the top of that file. I played with changing some of them and was able to have the error message change a bit (that is the number after "ask for" changes), but the error still occurred. While I'm tempted to think that changing one of these values will fix the problem, it's still odd to me that when I run with 31 tasks instead of 32, it works fine regardless of these settings.

stwhite91 commented 1 year ago

That is odd. I don't think I've ever seen this problem. What kind of Cray interconnect is this? Shasta, Aries, etc.?

panops99 commented 1 year ago

It does seem odd. It's Aries.

stwhite91 commented 1 year ago

This is generally how I built on a Cray Aries machine. I load the hugepages-8M module and either the PrgEnv-gnu or PrgEnv-intel compiler module.

./build AMPI-only gni-crayxc --with-production

You could also use the mpi-crayxc build if we can't figure this issue out. I don't think I have access to a Cray XC system anymore to try this myself. Perhaps someone in PPL does?

panops99 commented 1 year ago

The mpi-crayxc build works well, FYI. That is, I'm able to run with all 32 cores populated with ranks.

When I built the gni-crayxc build I did it with hugepages-2M and AMPI mode (not AMPI-only) using the PrgEnv-intel compiler module.

Anything else I should consider trying?

panops99 commented 1 year ago

I spoke too soon about mpi-crayxc. On non-SMP AMPI I'm getting this error on multinode runs (but not single node runs). Is this familiar?

Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 128 processes (PEs)
------------- Processor 121 Exiting: Called CmiAbort ------------
Reason: [121] Assertion "selfComm == (static_cast<MPI_Comm>(1000001))" failed in file /local/src/charm-v7.0.0_debug_no_smp/src/libs/ck-libs/ampi/ampi.C line 1115.

stwhite91 commented 1 year ago

Anything else I should consider trying?

You could try with 8M hugepages and PrgEnv-gnu, but I don't expect that either will make a difference. I would recommend trying the current main branch of Charm, rather than v7.0.0.

Reason: [121] Assertion "selfComm == (static_cast(1000001))" failed

This failed assertion is happening during initialization where AMPI creates MPI_COMM_SELF. I can't say I've seen that particular failure either.

evan-charmworks commented 1 year ago

Yes, try the latest in Git. I believe we fixed this issue in #3271 and #3308.

Regarding the GNI_RC_CHECK failure, we removed the GNI mempool from Isomalloc because it had correctness issues, so it may be showing up here as well.

panops99 commented 1 year ago

Thanks. The latest in git does fix the MPI_COMM_SELF bug mentioned above. This allows me to run multinode non-SMP mpi-crayxc applications.

I also tried the non-SMP gni-crayxc build with the latest but that problem persists for me.

charmplusplus / charm

Configuring AMPI hardware resources and worker thread allocation #3728