Excessive launch time on MVAPICH on Sierra

dongahn commented 6 years ago

In support of @koning, Adam Moody and I are looking at ways to support CUDA Aware MPI with MVAPICH on Sierra and we are seeing some excessive launch time at 160 MPI processes.

sierra4360{dahn}51: ml cuda/9.1.85
sierra4360{dahn}52: mvapich2-2.3/install-gnu-cuda-opt-sierra-flux/bin/mpicc -cc=xlc -g -O0 virtual_ring_mpi.c -o virtual_ring_mpi
sierra4360{dahn}58: bsub -nnodes 4 -Is -XF -G guests /usr/bin/tcsh
<<ssh X11 forwarding job>>
<<Waiting for dispatch ...>>
<<Starting on sierra4371>>
sierra4371{dahn}23: ml hwloc/1.11.10-cuda
sierra4371{dahn}24: ml cuda/9.1.85

sierra4371{dahn}31: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 16 -N 4 -c 10 -g 1 virtual_ring_mpi
[sierra1279:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
rcvbuf: 15
0.07user 0.01system 0:05.61elapsed 1%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1803minor)pagefaults 0swaps

sierra4371{dahn}32: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 16 -N 4 -c 10 virtual_ring_mpi
[sierra1279:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
rcvbuf: 15
0.08user 0.01system 0:05.64elapsed 1%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1803minor)pagefaults 0swaps

sierra4371{dahn}33: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 160 -N 4 virtual_ring_mpi
[sierra1279:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
rcvbuf: 159
0.08user 0.01system 0:23.30elapsed 0%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1803minor)pagefaults 0swaps

Test code:

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <signal.h>
#include <sys/time.h>
#define COMM_TAG 1000

static double get_secs(struct timeval* tv2, struct timeval* tv1)
{
  struct timeval result;
  timersub(tv2, tv1, &result);
  return (double) result.tv_sec + (double) result.tv_usec / 1000000.0;
}

void pass_its_neighbor(const int rank, const int size, int* buf)
{
  MPI_Request request[2];
  MPI_Status status[2];
  int rcvbuf;

  MPI_Irecv((void*)buf, 1, MPI_INT, ((rank+size-1)%size), COMM_TAG, MPI_COMM_WORLD, &request[0]);
  MPI_Isend((void*)&rank, 1, MPI_INT, ((rank+1)%size), COMM_TAG, MPI_COMM_WORLD, &request[1]);
  MPI_Waitall(2, request, status);
  MPI_Allreduce((void *) &rank, (void *) &rcvbuf, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);

  if (rank==0) {
    fprintf(stdout, "size: %d \n", size);
    fprintf(stdout, "rcvbuf: %d \n", rcvbuf);
  }

  MPI_Barrier(MPI_COMM_WORLD);
}

int main(int argc, char* argv[])
{

  int size, rank, i;
  int *buf = (int*) malloc(sizeof(int));
  struct timeval start, end;

  gettimeofday (&start, NULL);
  MPI_Init(&argc, &argv);
  gettimeofday (&end, NULL);
  double elapse = get_secs (&end, &start);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  double elapse_max;
  MPI_Allreduce (&elapse, &elapse_max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);

  if (rank == 0)
  printf ("MPI_Init time is %f\n", elapse_max);

  *buf=rank; /* we only pass around rank*/
  pass_its_neighbor(rank, size, buf);
  free (buf);

  MPI_Finalize();

  return 0;
}

grondo commented 6 years ago

Can you print the time spent in MPI_Init? E.g. use t/mpi/hello.c from flux-core?

dongahn commented 6 years ago

OK. I changed the test code (the posting updated) and measure MPI_Init()

sierra4371{dahn}52: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 160 -N 4 virtual_ring_mpi
[sierra1214:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
MPI_Init time is 16.738075
size: 160
rcvbuf: 159
0.08user 0.00system 0:24.59elapsed 0%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1803minor)pagefaults 0swaps

It seems there is really something wrong with MVAPICH initialization. Running true is 4x faster.

sierra4371{dahn}54: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 160 -N 4 true
0.08user 0.01system 0:06.49elapsed 1%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1803minor)pagefaults 0swaps

dongahn commented 6 years ago

Moving email discussions to here:

From Moody:

Not sure what’s going on with the hwloc message there. Just guessing, but I think my_local_id should give the local rank of each process on the node. This is something that MPI computes internally. For this to be -1 suggests that it’s not initialized correctly. If that’s really not working, it may have other implications for things like collective performance. I’ll try some runs without flux to see if it also shows up there.

As for CUDA version, it should build ok against the latest recommended CUDA. I just grabbed an older script and forgot to update that part. You should be able to update the CUDA module and path and rebuild.

? As for CUDA tests, mvapich also installs the osu_benchmarks. You can run a test like:

from gpu buffer to gpu buffer: MV2_USE_CUDA=1 jsrun … /path/to/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency D D

from host buffer to host buffer: MV2_USE_CUDA=1 jsrun … /path/to/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency H H

from host buffer to gpu buffer: MV2_USE_CUDA=1 jsrun … /path/to/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency H D

There are also osu_bw and osu_bibw for unidirectional and bidirectional bandwidth tests.

dongahn commented 6 years ago

In a build where I’m using mvapich’s mpirun_rsh to launch, I’m not seeing that hwloc error. I even loaded the same hwloc module.

Also, the startup is coming in around 5 seconds for 160 procs on 4 nodes for me:

sierra4367{moody20}40: time mvapich2-2.3/install-gnu-cuda-opt-sierra/bin/mpirun_rsh -rsh -np 160 -hostfile $LSB_DJOB_HOSTFILE ./mvapich2-2.3/install-gnu-cuda-opt-sierra/libexec/osu-micro-benchmarks/mpi/startup/osu_init
# OSU MPI Init Test v5.4.3
nprocs: 160, min: 5063 ms, max: 5207 ms, avg: 5111 ms
0.385u 0.142s 0:07.49 6.9%      0+0k 19912+256io 20pf+0w

sierra4367{moody20}41: time mvapich2-2.3/install-gnu-cuda-opt-sierra/bin/mpirun_rsh -rsh -np 160 -hostfile $LSB_DJOB_HOSTFILE ./mvapich2-2.3/install-gnu-cuda-opt-sierra/libexec/osu-micro-benchmarks/mpi/startup/osu_init
# OSU MPI Init Test v5.4.3
nprocs: 160, min: 4966 ms, max: 5072 ms, avg: 5038 ms
1.684u 0.152s 0:08.54 21.4%     0+0k 0+256io 0pf+0w

This is running through the mvapich mpirun PMI-1 interface.

dongahn commented 6 years ago

@garlick or @grondo: maybe time to trace PMI. How do you turn on PMI for tracing within Flux?

grondo commented 6 years ago

Try setting FLUX_PMI_DEBUG=1 before launching Flux. (edit: oh, that might not be the one you want, should have waited for @garlick)

Also try -o trace-pmi-server option to wreckrun.

dongahn commented 6 years ago

(@adammoody -- so that Adam will get notified)

I used FLUX_PMI_DEBUG=1, it looks like MVAPICH has the PMI pattern such that every task performs two PMI gets for everyone else.

At 8, 16, 32 processes :

sierra4358{dahn}29: cat pmi.n8.N4.out | grep ^0: | grep KVS_Get | wc
     19     103    1159
sierra4358{dahn}30: 
sierra4358{dahn}30: cat pmi.n16.N4.out | grep ^0: | grep KVS_Get | wc
     35     199    2221
sierra4358{dahn}31: 
sierra4358{dahn}31: cat pmi.n32.N4.out | grep ^0: | grep KVS_Get | wc
     67     391    4367

This means, at 160 processes, there will a total of 160*159 = 25,440 PMI_KVS_GET... I have a feeling that this will be pretty painful at larger scales...

@adammoody, does MVAPICH with rsh launching using a different kind of PMI patterns?

dongahn commented 6 years ago

Maybe I should also look at MPI_Init of Spectrum (with yalla of course). Stay tuned.

adammoody commented 6 years ago

That seems about right.

Below a certain scale, each process does one Get to lookup the address of each other process during init. Above that scale, the process only calls Get on the first Send.

I think there is another Get for each process to identify whether all procs have the same CPU and network card. You can likely avoid those by setting the following:

MV2_ON_DEMAND_THRESHOLD=0 (forces on demand at all process counts) MV2_HOMOGENEOUS_CLUSTER=1 (eliminates need to check CPU and network card)

I still don't understand why my_local_id=-1. Disabling the hwloc broadcast could introduce overhead, especially when running lots of procs on each node.

adammoody commented 6 years ago

Most PMI implementations execute an allgather of the entire set of key/values to each PMI server, and the Get calls are spread over multiple servers. I'm guessing flux does the same.

grondo commented 6 years ago

Hm, the pmi simple server is running in wrexecd, which does its kvs_put and kvs_get synchronously. I wonder if that might explain a bit of poor scaling as the number of local processes increases. Maybe some low hanging fruit there...

dongahn commented 6 years ago

Hm, the pmi simple server is running in wrexecd, which does its kvs_put and kvs_get synchronously. I wonder if that might explain a bit of poor scaling as the number of local processes increases. Maybe some low hanging fruit there...

Yeah. I think that sounds right. Let me do a few more tests and maybe we can develop a good understanding on this problem.

dongahn commented 6 years ago

MV2_ON_DEMAND_THRESHOLD=0 (forces on demand at all process counts) MV2_HOMOGENEOUS_CLUSTER=1 (eliminates need to check CPU and network card)

Don't seem to help. Still the same PMI_KVS_Get count.

sierra4371{dahn}44: env FLUX_PMI_DEBUG=1 PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 8 -N 4 bash -c 'unset PMI_LIBRARY; export MV2_ON_DEMAND_THRESHOLD=0; export MV2_HOMOGENEOUS_CLUSTER=1; virtual_ring_mpi'

sierra4371{dahn}43: cat out | grep ^0: | grep KVS_Get | wc
     19     103    1166

grondo commented 6 years ago

The PMI data is stored in the kvs under lwj.x.y.z.pmi if that helps.

I was playing with mvapich2 on ipa and got a strange result (not sure if it is even relevant) I don't know why the amount of PMI data in kvs shrinks at 96 tasks (possibly user error)

Another thing to note here is that I'm fairly confident there is no scaling problem with fork. Note the time to the "running" state (which is only reached after all tasks have been forked and a global barrier completes).

$ srun --mpibind=off -p p100 -N2 --pty /g/g0/grondo/flux/bin/flux start
 for i in 16 32 64 96 112 128 150; do flux wreckrun -I -n $i ./hello; done
0: completed MPI_Init in 0.557s.  There are 16 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.034s
0: completed MPI_Init in 1.247s.  There are 32 tasks
0: completed first barrier in 0.007s
0: completed MPI_Finalize in 0.106s
0: completed MPI_Init in 4.264s.  There are 64 tasks
0: completed first barrier in 0.005s
0: completed MPI_Finalize in 0.307s
0: completed MPI_Init in 5.509s.  There are 96 tasks
0: completed first barrier in 0.029s
0: completed MPI_Finalize in 0.100s
0: completed MPI_Init in 6.699s.  There are 112 tasks
0: completed first barrier in 0.020s
0: completed MPI_Finalize in 0.097s
0: completed MPI_Init in 7.902s.  There are 128 tasks
0: completed first barrier in 0.187s
0: completed MPI_Finalize in 0.117s
0: completed MPI_Init in 11.275s.  There are 150 tasks
0: completed first barrier in 0.018s
0: completed MPI_Finalize in 0.219s
(flux-EMYtCD) grondo@ipa4:~/git/flux-core.git/t/mpi$ flux wreck timing
    ID       NTASKS     STARTING      RUNNING     COMPLETE        TOTAL
     1           16       0.033s       0.044s       0.871s       0.915s
     2           32       0.024s       0.036s       1.412s       1.447s
     3           64       0.035s       0.044s       4.722s       4.765s
     4           96       0.049s       0.056s       5.766s       5.822s
     5          112       0.049s       0.061s       7.027s       7.089s
     6          128       0.052s       0.069s       8.422s       8.491s
     7          150       0.058s       0.079s      11.754s      11.833s
$ for i in 1 2 3 4 5 6 7; do echo $(flux kvs get lwj.0.0.${i}.ntasks) tasks: $(flux kvs dir -R lwj.0.0.${i}.pmi | wc -l); done
16 tasks: 513
32 tasks: 2049
64 tasks: 8193
96 tasks: 193
112 tasks: 225
128 tasks: 257
150 tasks: 301

dongahn commented 6 years ago

The PMI data is stored in the kvs under lwj.x.y.z.pmi if that helps.

Question: if the number of PMI keys grow too long, will we suffer from a long directory performance degradation?

grondo commented 6 years ago

Question: if the number of PMI keys grow too long, will we suffer from a long directory performance degradation?

Yes I think so.

grondo commented 6 years ago

Yes I think so.

I take that back, I can't remember the details very well at the moment, but the impact may be reduced for the PMI case because I assume the KVS puts are collected (mostly) under one commit.

dongahn commented 6 years ago

It does look like Spectrum MPI also has a similar pattern also the counts are slightly less:

sierra4371{dahn}47: cat 8tasks.out | grep ^0: | grep KVS_Get | wc
     22     113    2880

sierra4370{dahn}41: cat 64.tasks.out | grep ^0: | grep KVS_Get | wc
     92     589   19224

And the MPI_Init time for 160 tasks across 4 nodes is 6.571470, which is ~ 2.5x faster than that of MVAPICH.

From @adammoody's comment:

Below a certain scale, each process does one Get to lookup the address of each other process during init. Above that scale, the process only calls Get on the first Send.

Maybe this overheads isn't a big deal at larger scale. Maybe I should run this at larger scale and see if the overhead should be tolerable for @koning's current use case.

dongahn commented 6 years ago

OK. Maybe this problem isn't so bad. MPI_Init at 1024 tasks over 256 nodes is only 6.3 seconds.

sierra4368{dahn}24: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 256 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 1024 -N 256 bash -c 'unset PMI_LIBRARY; virtual_ring_mpi'
[sierra1210:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
MPI_Init time is 6.288508
size: 1024
rcvbuf: 1023
0.07user 0.02system 1:19.59elapsed 0%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1802minor)pagefaults 0swaps

dongahn commented 6 years ago

MPI_Init at 1440 tasks across 360 nodes is 8.7 secs.

sierra4369{dahn}23: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 360 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 1440 -N 360 bash -c 'unset PMI_LIBRARY; virtual_ring_mpi'
MPI_Init time is 8.669813
size: 1440
rcvbuf: 1439
0.08user 0.01system 2:15.39elapsed 0%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1805minor)pagefaults 0swaps

@adammoody: Please let us know if you recommend @koning and the team to test LBANN with this version of MVAPICH or you feel we need to investigate "my_local_id: -1" problem.

dongahn commented 6 years ago

Final data point for tonight:

sierra4369{dahn}26: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 360 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 1280 -N 40 bash -c 'unset PMI_LIBRARY; virtual_ring_mpi'
[sierra1210:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast

I waited for awhile for this case, and this has not been finished.

This seems to indicate that there still is a legitimate performance problem either in Flux or MVAPICH for the fully populated launch. ' my_local_id: -1' warning could be a part of this. But given our observations with other MPIs, a big part of this could be Flux.

Maybe an investigation needed for our next execution system.

grondo commented 6 years ago

Unfortunately there is no quick fix for the PMI implementation in wreck. The PMI simple_server.c code will need to be reworked to allow an implementor to use asynchronous calls.

@dongahn, based on my testing, @trws' previous observations, and your experiments, I feel confident that the task-per-node scaling problem is due to this problem with Flux serializing the simple PMI KVS operations for all tasks on a node.

dongahn commented 6 years ago

Thanks.So that I understand: When you have the same number of tasks spread across more number of the nodes, why is this faster? Because kvs get is satisfied by the slave cache? And/Or it give more opportunities to the kvs master to optimize?

trws commented 6 years ago

If I'm understanding correctly, it's that there is less contention on serialized synchronous KVS operations on each of the brokers running jobs. I hadn't dug in far enough to see that, but it would make a lot of sense for that to be the problem, it's the same issue we were having with synchronous RPCs in other parts of the system in getting the pilot2 stuff to scale.

grondo commented 6 years ago

Yes, @trws, that is my current theory. All tasks on a rank go through one wrexecd reactor for all PMI operations, and this implementation uses synchronous kvs_put and kvs_get. Spreading tasks across more wrexecds reduces the number of tasks a single PMI server has to handle. However it also spreads the kvs load across more caches, so there could be two factors here.

dongahn commented 6 years ago

Yes, @trws, that is my current theory. All tasks on a rank go through one wrexecd reactor for all PMI operations, and this implementation uses synchronous kvs_put and kvs_get. Spreading tasks across more wrexecds reduces the number of tasks a single PMI server has to handle. However it also spreads the kvs load across more caches, so there could be two factors here.

One thing that is still buzzing me: If this is the case, the degradation should roughly be single broker serialization + some scaling cost. While per-wreckd serialization is bad, they are happening in parallel. And still scaling costs don't seem that bad with each node lowly populated.

But last night, flux wreckrun -n 1280 -N 40, didn't even finish in ~20 mins.

I might spend a little more time to collect more data.

@grondo: do you know how to redirect output from FLUX_PMI_DEBUG into a file?

grondo commented 6 years ago

Hmm, let me think about that one..

grondo commented 6 years ago

@grondo: do you know how to redirect output from FLUX_PMI_DEBUG into a file?

This works for me (you have to redirect to a file per task unfortunately)

$ flux wreckrun -n2 sh -c 'FLUX_PMI_DEBUG=1 ./hello 2>pmi_debug.$FLUX_TASK_RANK'
0: completed MPI_Init in 0.212s.  There are 2 tasks
0: completed first barrier in 0.002s
0: completed MPI_Finalize in 0.007s
$ wc -l pmi_debug.0
37 pmi_debug.0
$ wc -l pmi_debug.1
37 pmi_debug.1

dongahn commented 6 years ago

One thing that is still buzzing me: If this is the case, the degradation should roughly be single broker serialization + some scaling cost. While per-wreckd serialization is bad, they are happening in parallel. And still scaling costs don't seem that bad with each node lowly populated.

Now that I think about this, there is also a scale factor in per-wreckd PMI and since the patter. Is all to all!

trws commented 6 years ago

Right, also we have some kind of performance cliff when you cross ~1500 tasks however they're distributed. A more thorough scaling study might help nail some of this down, but we may be seeing two or more different limits being hit.

dongahn commented 6 years ago

Right, also we have some kind of performance cliff when you cross ~1500 tasks however they're distributed. A more thorough scaling study might help nail some of this down, but we may be seeing two or more different limits being hit.

I am also curious if we redirect PMI to use IBM's PMIX server, if we can get some speedup.

trws commented 6 years ago

How do you mean? This is something we could possibly do, but their orted processes would have to wire up to provide it, so unless we worked out a way to launch them in a persistent mode along with the flux instance, then pass them the per-job data, it would likely be at least a bit of a slowdown.

On 27 Jul 2018, at 13:41, Dong H. Ahn wrote:

Right, also we have some kind of performance cliff when you cross ~1500 tasks however they're distributed. A more thorough scaling study might help nail some of this down, but we may be seeing two or more different limits being hit.

I am also curious if we redirect PMI to use IBM's PMIX server, if we can get some speedup.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1606#issuecomment-408533271

adammoody commented 6 years ago

On a related note regarding using IBM's PMIx server, I emailed Dave Solt last night inquiring about support for PMI and PMI2. It sounds like that should work, however, I can't find either of those on our systems right now. I think the headers/libs aren't installed for some reason. It'd be nice to have that as we have some software that uses PMI, but not PMIx, and it'd be nice to launch such software via jsrun.

I don't think MVAPICH supports PMIx just yet, although Open MPI does.

dongahn commented 6 years ago

This path may not work. But I am launching my flux as

PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 128 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start

And /usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so is a PMI that is built against PMIX. I am wondering if I can redirect the target jobs' PMI to this pmi instead of flux's pmi. May not work; but if this works, we will have some data points to compare against.

dongahn commented 6 years ago

On a related note regarding using IBM's PMIx server, I emailed Dave Solt last night inquiring about support for PMI and PMI2. It sounds like that should work, however, I can't find either of those on our systems right now. I think the headers/libs aren't installed for some reason. It'd be nice to have that as we have some software that uses PMI, but not PMIx, and it'd be nice to launch such software via jsrun.

/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so is the one that I built against the PMIX so you should be able to launch MVACHI this way too using jsrun. The same way as we launch Flux.

grondo commented 6 years ago

As an experiment I split the pmi "simple server" kvs_get calls to allow the implementor to use an async version. I'm seeing a modest improvement (maybe 20%) compared to the initial test above, but it doesn't seem to be the whole story (I could try the same thing with the puts)

$ for i in 16 32 64 96 112 128 150; do flux wreckrun -I -n $i ./hello; done
0: completed MPI_Init in 0.545s.  There are 16 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.033s
0: completed MPI_Init in 1.008s.  There are 32 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.098s
0: completed MPI_Init in 2.939s.  There are 64 tasks
0: completed first barrier in 0.032s
0: completed MPI_Finalize in 0.275s
0: completed MPI_Init in 4.287s.  There are 96 tasks
0: completed first barrier in 0.317s
0: completed MPI_Finalize in 0.074s
0: completed MPI_Init in 5.147s.  There are 112 tasks
0: completed first barrier in 0.044s
0: completed MPI_Finalize in 0.083s
0: completed MPI_Init in 6.300s.  There are 128 tasks
0: completed first barrier in 0.149s
0: completed MPI_Finalize in 0.110s
0: completed MPI_Init in 8.508s.  There are 150 tasks
0: completed first barrier in 0.032s
0: completed MPI_Finalize in 0.156s

dongahn commented 6 years ago

Is "hello" a MVAPICH program? How many nodes are you using?

grondo commented 6 years ago

Sorry, this is the same test as above, 2 nodes on IPA with 36 cores each:

$ flux hwloc info
2 Machines, 72 Cores, 144 PUs

hello is the MPI program from t/mpi/hello.c, compiled with mpicc from mvapich2:

$ ldd ./hello| grep mvapic
    libmpi.so.12 => /usr/tce/packages/mvapich2/mvapich2-2.2-intel-18.0.1/lib/libmpi.so.12 (0x00002aaaaaccf000)

Oh, I guess it is using the intel compiler version, but that should be irrelevant.

Here's the results from above without simple PMI mods:

0: completed MPI_Init in 0.557s.  There are 16 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.034s
0: completed MPI_Init in 1.247s.  There are 32 tasks
0: completed first barrier in 0.007s
0: completed MPI_Finalize in 0.106s
0: completed MPI_Init in 4.264s.  There are 64 tasks
0: completed first barrier in 0.005s
0: completed MPI_Finalize in 0.307s
0: completed MPI_Init in 5.509s.  There are 96 tasks
0: completed first barrier in 0.029s
0: completed MPI_Finalize in 0.100s
0: completed MPI_Init in 6.699s.  There are 112 tasks
0: completed first barrier in 0.020s
0: completed MPI_Finalize in 0.097s
0: completed MPI_Init in 7.902s.  There are 128 tasks
0: completed first barrier in 0.187s
0: completed MPI_Finalize in 0.117s
0: completed MPI_Init in 11.275s.  There are 150 tasks
0: completed first barrier in 0.018s
0: completed MPI_Finalize in 0.219s

grondo commented 6 years ago

It's in a pretty rough state right now, but @dongahn if you want to try out these changes they are parked in my pmi-async-kvs branch.

dongahn commented 6 years ago

Ok. It looks to me like the per-wreckd serialization on PMI is indeed a part of the bottleneck:

#!/usr/bin/bash

ml hwloc/1.11.10-cuda
ml cuda/9.1.85

for n in 256 128 64 32; do
echo "1280 tasks over $n nodes: Flux PMI"
PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 256 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start bash -c "unset PMI_LIBRARY; flux wreckrun -n 1280 -N $n virtual_ring_mpi"
done

1280 tasks over 256 nodes: Flux PMI
MPI_Init time is 8.658265

1280 tasks over 128 nodes: Flux PMI
MPI_Init time is 14.018319

1280 tasks over 64 nodes: Flux PMI
MPI_Init time is 27.623413

1280 tasks over 32 nodes: Flux PMI
MPI_Init time is 54.769029

Keeping the total # of tasks constant, I decrease the node count until each node gets fully populated (40 x 32nodes = 1280 tasks)

Each time, the MPI_Init () time is roughly doubled. I think it makes sense because at each decrement, the number of PMI_KVS_Get () calls on each node doubled. For example, at 1280 tasks, 32 nodes, each node will do O(40 x 2 x 1280) Gets.

If this trend holds, at 10240 task at 256 (fully populating 256 nodes), MPI_Init will take over 7 mins. That is, unless 10240 is a scale at which MVAPICH's initialization protocol uses lazy PMI calls (at the first Send).

@grondo: So definitely worth trying your change here.

grondo commented 6 years ago

BTW, looking at the simple PMI implementation, kvs_puts are queued into a kvs transaction and committed globally to the KVS under a single fence (Well when PMI client calls commit). The put case seems to be pretty well optimized (and as long as there are minimal commits, the directory size issue should be minimal I would think (IIUC))

dongahn commented 6 years ago

Sounds good. And I don't believe put operations are on the critical path any way. Each process will do O(1) puts. But Get is 2 x N.

dongahn commented 6 years ago

It's in a pretty rough state right now, but @dongahn if you want to try out these changes they are parked in my pmi-async-kvs branch.

Hmmm. I tried to run this with the same test as above. It seems this version just hangs...

sierra4368{dahn}27: ./test.sh

The following have been reloaded with a version change:
  1) cuda/9.2.88 => cuda/9.1.85

1280 tasks over 256 nodes: Flux PMI
[sierra1190:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast

#!/usr/bin/bash

ml hwloc/1.11.10-cuda
ml cuda/9.1.85

#FLUX_PATH=/usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux
FLUX_PATH=/nfs/tmp2/dahn/PMI_IMPROVE/bin/flux

for n in 256 128 64 32; do
echo "1280 tasks over $n nodes: Flux PMI"
PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 256 $FLUX_PATH start bash -c "unset PMI_LIBRARY; flux wreckrun -n 1280 -N $n virtual_ring_mpi"
done

grondo commented 6 years ago

Ok, sorry. Can you launch any size MVAPICH jobs using that branch? Maybe I should try a fresh checkout.

grondo commented 6 years ago

I can't reproduce any issue @dongahn, after cloning a fresh checkout of this branch. Any debug output would be helpful (also a sanity check that even a 2 task launch works).

Thanks!

dongahn commented 6 years ago

Small scales seem to work.

sierra4370{dahn}27: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 2 /nfs/tmp2/dahn/PMI_IMPROVE/bin/flux start bash -c "unset PMI_LIBRARY; flux wreckrun -n 2 -N 2 virtual_ring_mpi"
[sierra1894:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
MPI_Init time is 0.332744
size: 2
rcvbuf: 1

sierra4370{dahn}28: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 2 /nfs/tmp2/dahn/PMI_IMPROVE/bin/flux start bash -c "unset PMI_LIBRARY; flux wreckrun -n 80 -N 2 virtual_ring_mpi"
[sierra1894:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
MPI_Init time is 10.470234
size: 80
rcvbuf: 79
0.09user 0.00system 0:15.57elapsed 0%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1804minor)pagefaults 0swaps

grondo commented 6 years ago

bummer. I guess I should have expected it wouldn't be that easy. I'll keep trying to see if I can reproduce at some scale.

dongahn commented 6 years ago

sierra4367{dahn}24: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 128 /nfs/tmp2/dahn/PMI_IMPROVE/bin/flux start bash -c "unset PMI_LIBRARY; flux wreckrun -n 128 -N 128 virtual_ring_mpi"
[sierra1390:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast

I wanted to see if the previous hang was transient. But this one seems to take a long time too.

grondo commented 6 years ago

1 task per node fails? That is strange. No other differences between the previous version of flux-core you were using? I wonder if I introduced a race condition. Could you run one of the smaller problematic jobs with -o trace-pmi-server? (It might be too much output, nevermind if so, I'll just continue to attempt to reproduce the problem)

dongahn commented 6 years ago

Seems like the hangs start at 52 nodes! (flux wreckrun -n 52 -N 52 virtual_ring_mpi). -n 51 -N 51 is okay. Not sure if this is helpful for you.

flux-framework / flux-core

Excessive launch time on MVAPICH on Sierra #1606