Closed dongahn closed 6 years ago
with -o trace-pmi-server?
Lots of output. But the last few lines:
2018-07-28T00:07:46.196247Z job.err[1]: job1: wrexecd says: 1: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196439Z job.err[4]: job1: wrexecd says: 4: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196479Z job.err[3]: job1: wrexecd says: 3: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196618Z job.err[9]: job1: wrexecd says: 9: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196615Z job.err[8]: job1: wrexecd says: 8: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196639Z job.err[10]: job1: wrexecd says: 10: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196689Z job.err[7]: job1: wrexecd says: 7: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196778Z job.err[15]: job1: wrexecd says: 15: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196844Z job.err[16]: job1: wrexecd says: 16: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196886Z job.err[17]: job1: wrexecd says: 17: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196897Z job.err[22]: job1: wrexecd says: 22: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196886Z job.err[18]: job1: wrexecd says: 18: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196911Z job.err[20]: job1: wrexecd says: 20: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197004Z job.err[37]: job1: wrexecd says: 37: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196895Z job.err[21]: job1: wrexecd says: 21: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197018Z job.err[31]: job1: wrexecd says: 31: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196877Z job.err[19]: job1: wrexecd says: 19: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197053Z job.err[36]: job1: wrexecd says: 36: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197019Z job.err[45]: job1: wrexecd says: 45: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.196996Z job.err[32]: job1: wrexecd says: 32: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197045Z job.err[40]: job1: wrexecd says: 40: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197078Z job.err[38]: job1: wrexecd says: 38: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197022Z job.err[44]: job1: wrexecd says: 44: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197077Z job.err[33]: job1: wrexecd says: 33: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197040Z job.err[39]: job1: wrexecd says: 39: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197068Z job.err[35]: job1: wrexecd says: 35: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197038Z job.err[46]: job1: wrexecd says: 46: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197051Z job.err[34]: job1: wrexecd says: 34: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197035Z job.err[42]: job1: wrexecd says: 42: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197118Z job.err[43]: job1: wrexecd says: 43: S: cmd=barrier_out rc=0
2018-07-28T00:07:46.197049Z job.err[41]: job1: wrexecd says: 41: S: cmd=barrier_out rc=0
1 task per node fails? That is strange. No other differences between the previous version of flux-core you were using?
There are of course lots of changes in the flux core master. So it is not clear if this problem is due to your change or other changes so far.
@dongahn, don't want to waste your time on this, but I can't reproduce with even -N128 -n128!
Maybe I will need to try the exact MVAPICH version you are using. Just to verify no bad nodes, if you switch back to flux-core master, everything works fine?
I think the "barrier_out" command is indicating the PMI barrier has been reached and all tasks have joined. I didn't touch the barrier code, so I'm not sure what is going on here.
I think the "barrier_out" command is indicating the PMI barrier has been reached and all tasks have joined. I didn't touch the barrier code, so I'm not sure what is going on here.
Maybe there is a regression in our 0.10 target... Uh oh.
There are of course lots of changes in the flux core master. So it is not clear if this problem is due to your change or other changes so far.
Hm, we should definitely sanity check v0.10.0 with this MVAPICH.
The PMI code hasn't changed in awhile, so if you want you could also try applying the one commit on my pmi-async-kvs branch directly to whatever working copy you are using. But let's make sure master works too.
On the pmi_client side there were some recent changes though.. what revision have you been working with?
I have to drop offline for the evening, but let me know how else I can help
No confirmed hang at 1a5fa1cb684a9c1c2bfbd579f802eab60a0174a3
One last thing, I was able to sanity check -N128 up to (but not including) -n512 on IPA with whatever mvapich2 is in /usr/tce. At 512 tasks MPI program coredumped under MPIDI_CH3I_SMP_init
(possibly oversubscribing nodes too much)
I think this issue is probably tickled by whatever PMI difference is in the mvapich version you are using on Sierra.
pmi client changes landed in 03a1ac97fdf12fc7c0763a1481b30a3316c9b308 (however, I would be surprised if they caused the hang. I apologize for wasting a bunch of your time if the hang is caused by my test patch!!)
Current master hangs too starting from -n51 -N51.
Can you tell whether they are doing lots of barrier/fence calls? At one time, they were doing this in some cases. I thought they removed it, but I can see some cases where it might still occur. I think I patched our 2.2 install to avoid that. I may need to apply that patch again if it's biting once more.
I ran some tests with my installed version and a few commits between that version and the current master.
My observation is that 1) there may be a problem even within the installed version 1a5fa1c at some configuration:
PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 128 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start bash -c unset PMI_LIBRARY; flux wreckrun -n 128 -N 128 virtual_ring_mpi
currently hangs!
2) The hang may have nothing to do with flux but some problems in the system itself. (Other users are reporting such hangs).
If we determine 1 is our problem. We need to file a separate issue.
For this particular issue, though, there are some scales 1a5fa1c can reliably launch. I think it makes sense to run @grondo's async work at that scale to see if we can get some numbers.
In the meantime, @adammoody and I mapped out a plan to debug "Invalid my_local_id" issue. He built a a version of MVAPICH against my pmi4pmix library and we confirm this can be launched with jsrun. Since we have good totalview support with jsrun, he will debug this problem under jsrun environment early next week.
OK. "-n 1280 -N 128" seem to be a reliable scale and @grondo's branch also works here.
PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 128 /nfs/tmp2/dahn/PMI_IMPROVE/master/bin/flux start bash -c unset PMI_LIBRARY; flux wreckrun -n 1280 -N 128 virtual_ring_mpi
[sierra1190:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
MPI_Init time is 11.451751
size: 1280
rcvbuf: 1279
Looks like performance improvement is ~20%. Similar to what @grondo saw. So I recommend the direction.
I am closing this issue in favor other individual tickets created.
In support of @koning, Adam Moody and I are looking at ways to support CUDA Aware MPI with MVAPICH on Sierra and we are seeing some excessive launch time at 160 MPI processes.
Test code: