Excessive launch time on MVAPICH on Sierra

dongahn commented 6 years ago

In support of @koning, Adam Moody and I are looking at ways to support CUDA Aware MPI with MVAPICH on Sierra and we are seeing some excessive launch time at 160 MPI processes.

sierra4360{dahn}51: ml cuda/9.1.85
sierra4360{dahn}52: mvapich2-2.3/install-gnu-cuda-opt-sierra-flux/bin/mpicc -cc=xlc -g -O0 virtual_ring_mpi.c -o virtual_ring_mpi
sierra4360{dahn}58: bsub -nnodes 4 -Is -XF -G guests /usr/bin/tcsh
<<ssh X11 forwarding job>>
<<Waiting for dispatch ...>>
<<Starting on sierra4371>>
sierra4371{dahn}23: ml hwloc/1.11.10-cuda
sierra4371{dahn}24: ml cuda/9.1.85

sierra4371{dahn}31: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 16 -N 4 -c 10 -g 1 virtual_ring_mpi
[sierra1279:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
rcvbuf: 15
0.07user 0.01system 0:05.61elapsed 1%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1803minor)pagefaults 0swaps

sierra4371{dahn}32: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 16 -N 4 -c 10 virtual_ring_mpi
[sierra1279:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
rcvbuf: 15
0.08user 0.01system 0:05.64elapsed 1%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1803minor)pagefaults 0swaps

sierra4371{dahn}33: env PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 4 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start flux wreckrun -n 160 -N 4 virtual_ring_mpi
[sierra1279:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
rcvbuf: 159
0.08user 0.01system 0:23.30elapsed 0%CPU (0avgtext+0avgdata 10560maxresident)k
0inputs+128outputs (0major+1803minor)pagefaults 0swaps

Test code:

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <signal.h>
#include <sys/time.h>
#define COMM_TAG 1000

static double get_secs(struct timeval* tv2, struct timeval* tv1)
{
  struct timeval result;
  timersub(tv2, tv1, &result);
  return (double) result.tv_sec + (double) result.tv_usec / 1000000.0;
}

void pass_its_neighbor(const int rank, const int size, int* buf)
{
  MPI_Request request[2];
  MPI_Status status[2];
  int rcvbuf;

  MPI_Irecv((void*)buf, 1, MPI_INT, ((rank+size-1)%size), COMM_TAG, MPI_COMM_WORLD, &request[0]);
  MPI_Isend((void*)&rank, 1, MPI_INT, ((rank+1)%size), COMM_TAG, MPI_COMM_WORLD, &request[1]);
  MPI_Waitall(2, request, status);
  MPI_Allreduce((void *) &rank, (void *) &rcvbuf, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);

  if (rank==0) {
    fprintf(stdout, "size: %d \n", size);
    fprintf(stdout, "rcvbuf: %d \n", rcvbuf);
  }

  MPI_Barrier(MPI_COMM_WORLD);
}

int main(int argc, char* argv[])
{

  int size, rank, i;
  int *buf = (int*) malloc(sizeof(int));
  struct timeval start, end;

  gettimeofday (&start, NULL);
  MPI_Init(&argc, &argv);
  gettimeofday (&end, NULL);
  double elapse = get_secs (&end, &start);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  double elapse_max;
  MPI_Allreduce (&elapse, &elapse_max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);

  if (rank == 0)
  printf ("MPI_Init time is %f\n", elapse_max);

  *buf=rank; /* we only pass around rank*/
  pass_its_neighbor(rank, size, buf);
  free (buf);

  MPI_Finalize();

  return 0;
}

dongahn commented 6 years ago

with -o trace-pmi-server?

Lots of output. But the last few lines:

2018-07-28T00:07:46.196247Z job.err[1]: job1: wrexecd says: 1: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196439Z job.err[4]: job1: wrexecd says: 4: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196479Z job.err[3]: job1: wrexecd says: 3: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196618Z job.err[9]: job1: wrexecd says: 9: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196615Z job.err[8]: job1: wrexecd says: 8: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196639Z job.err[10]: job1: wrexecd says: 10: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196689Z job.err[7]: job1: wrexecd says: 7: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196778Z job.err[15]: job1: wrexecd says: 15: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196844Z job.err[16]: job1: wrexecd says: 16: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196886Z job.err[17]: job1: wrexecd says: 17: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196897Z job.err[22]: job1: wrexecd says: 22: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196886Z job.err[18]: job1: wrexecd says: 18: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196911Z job.err[20]: job1: wrexecd says: 20: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197004Z job.err[37]: job1: wrexecd says: 37: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196895Z job.err[21]: job1: wrexecd says: 21: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197018Z job.err[31]: job1: wrexecd says: 31: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196877Z job.err[19]: job1: wrexecd says: 19: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197053Z job.err[36]: job1: wrexecd says: 36: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197019Z job.err[45]: job1: wrexecd says: 45: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196996Z job.err[32]: job1: wrexecd says: 32: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197045Z job.err[40]: job1: wrexecd says: 40: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197078Z job.err[38]: job1: wrexecd says: 38: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197022Z job.err[44]: job1: wrexecd says: 44: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197077Z job.err[33]: job1: wrexecd says: 33: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197040Z job.err[39]: job1: wrexecd says: 39: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197068Z job.err[35]: job1: wrexecd says: 35: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197038Z job.err[46]: job1: wrexecd says: 46: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197051Z job.err[34]: job1: wrexecd says: 34: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197035Z job.err[42]: job1: wrexecd says: 42: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197118Z job.err[43]: job1: wrexecd says: 43: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197049Z job.err[41]: job1: wrexecd says: 41: S: cmd=barrier_out rc=0

dongahn commented 6 years ago

1 task per node fails? That is strange. No other differences between the previous version of flux-core you were using?

There are of course lots of changes in the flux core master. So it is not clear if this problem is due to your change or other changes so far.

grondo commented 6 years ago

@dongahn, don't want to waste your time on this, but I can't reproduce with even -N128 -n128!

Maybe I will need to try the exact MVAPICH version you are using. Just to verify no bad nodes, if you switch back to flux-core master, everything works fine?

I think the "barrier_out" command is indicating the PMI barrier has been reached and all tasks have joined. I didn't touch the barrier code, so I'm not sure what is going on here.

dongahn commented 6 years ago

I think the "barrier_out" command is indicating the PMI barrier has been reached and all tasks have joined. I didn't touch the barrier code, so I'm not sure what is going on here.

Maybe there is a regression in our 0.10 target... Uh oh.

grondo commented 6 years ago

There are of course lots of changes in the flux core master. So it is not clear if this problem is due to your change or other changes so far.

Hm, we should definitely sanity check v0.10.0 with this MVAPICH.

The PMI code hasn't changed in awhile, so if you want you could also try applying the one commit on my pmi-async-kvs branch directly to whatever working copy you are using. But let's make sure master works too.

grondo commented 6 years ago

On the pmi_client side there were some recent changes though.. what revision have you been working with?

grondo commented 6 years ago

I have to drop offline for the evening, but let me know how else I can help

dongahn commented 6 years ago

No confirmed hang at 1a5fa1cb684a9c1c2bfbd579f802eab60a0174a3

grondo commented 6 years ago

One last thing, I was able to sanity check -N128 up to (but not including) -n512 on IPA with whatever mvapich2 is in /usr/tce. At 512 tasks MPI program coredumped under MPIDI_CH3I_SMP_init (possibly oversubscribing nodes too much)

I think this issue is probably tickled by whatever PMI difference is in the mvapich version you are using on Sierra.

grondo commented 6 years ago

pmi client changes landed in 03a1ac97fdf12fc7c0763a1481b30a3316c9b308 (however, I would be surprised if they caused the hang. I apologize for wasting a bunch of your time if the hang is caused by my test patch!!)

dongahn commented 6 years ago

Current master hangs too starting from -n51 -N51.

adammoody commented 6 years ago

Can you tell whether they are doing lots of barrier/fence calls? At one time, they were doing this in some cases. I thought they removed it, but I can see some cases where it might still occur. I think I patched our 2.2 install to avoid that. I may need to apply that patch again if it's biting once more.

dongahn commented 6 years ago

I ran some tests with my installed version and a few commits between that version and the current master.

My observation is that 1) there may be a problem even within the installed version 1a5fa1c at some configuration:

PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 128 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start bash -c unset PMI_LIBRARY; flux wreckrun -n 128 -N 128 virtual_ring_mpi

currently hangs!

2) The hang may have nothing to do with flux but some problems in the system itself. (Other users are reporting such hangs).

If we determine 1 is our problem. We need to file a separate issue.

For this particular issue, though, there are some scales 1a5fa1c can reliably launch. I think it makes sense to run @grondo's async work at that scale to see if we can get some numbers.

In the meantime, @adammoody and I mapped out a plan to debug "Invalid my_local_id" issue. He built a a version of MVAPICH against my pmi4pmix library and we confirm this can be launched with jsrun. Since we have good totalview support with jsrun, he will debug this problem under jsrun environment early next week.

dongahn commented 6 years ago

OK. "-n 1280 -N 128" seem to be a reliable scale and @grondo's branch also works here.

PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 128 /nfs/tmp2/dahn/PMI_IMPROVE/master/bin/flux start bash -c unset PMI_LIBRARY; flux wreckrun -n 1280 -N 128 virtual_ring_mpi
[sierra1190:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
MPI_Init time is 11.451751
size: 1280
rcvbuf: 1279

Looks like performance improvement is ~20%. Similar to what @grondo saw. So I recommend the direction.

I am closing this issue in favor other individual tickets created.

flux-framework / flux-core

Excessive launch time on MVAPICH on Sierra #1606