Open mathomp4 opened 4 years ago
"WARNING: no serverthread" is fine. We do have case without serverthread
Oh. Huh. Well, it still seems to hang, so it's possible it's the test after that that is important?
cmake failed for the g5_modules Could NOT find MPI_C (missing: MPI_C_WORKS) -- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS) -- Could NOT find MPI_Fortran (missing: MPI_Fortran_WORKS) CMake Error at /gpfsm/dulocal/sles12/other/cmake/3.17.0/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:164 (message): Could NOT find MPI (missing: MPI_C_FOUND MPI_CXX_FOUND MPI_Fortran_FOUND) Call Stack (most recent call first):
module list
Currently Loaded Modules: 1) git/2.24.0 5) ImageMagick/7.0.9-16 9) mpi/mvapich2/2.3.4/intel-19.1.1.217-omnipath 2) cmake/3.17.0 6) GEOSenv 10) python/GEOSpyD/Ana2019.10_py2.7 3) other/manage_externals 7) comp/gcc/8.3.0 4) other/mepo 8) comp/intel/19.1.1.217
@weiyuan-jiang I think they might be in the wrong order. You'll need to at least have the compiler and MPI bits in this order:
comp/gcc/8.3.0
comp/intel/19.1.1.217
mpi/mvapich2/2.3.4/intel-19.1.1.217-omnipath
python/GEOSpyD/Ana2019.10_py2.7
Welp, I just built MVAPICH2 2.3.6 for Intel 2021.2 (figured maybe I could try it out on SCU16 when it's around) and I get the same crash:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
GEOSgcm.x 00000000031A756A Unknown Unknown Unknown
libpthread-2.22.s 00002AAAC8D8EC10 Unknown Unknown Unknown
libmpi.so.12.1.1 00002AAAC825F14B MPIDI_CH3I_Put Unknown Unknown
libmpi.so.12.1.1 00002AAAC825E7D4 MPID_Put Unknown Unknown
libmpi.so.12.1.1 00002AAAC81CC3DA MPI_Put Unknown Unknown
libmpifort.so.12. 00002AAAC7CCFEC7 mpi_put_ Unknown Unknown
libMAPL.pfio.so 00002AAAC094A9B5 pfio_serverthread 906 ServerThread.F90
libMAPL.pfio.so 00002AAAC08E86B7 pfio_baseservermo 69 BaseServer.F90
libMAPL.pfio.so 00002AAAC094D38A pfio_serverthread 980 ServerThread.F90
libMAPL.pfio.so 00002AAAC08AD0D6 pfio_messagevisit 93 MessageVisitor.F90
Line changed, but probably the same MPI_Put call:
I also tried the MAPL tests and, again, MAPL.pfio.tests
"locks up". Running with -d:
Start: <Test_DirectoryService_suite.test_put_directory[npes=1][npes=1]>
. end: <Test_DirectoryService_suite.test_put_directory[npes=1][npes=1]>
Start: <Test_DirectoryService_suite.test_publish[npes=1][npes=1]>
. end: <Test_DirectoryService_suite.test_publish[npes=1][npes=1]>
Start: <Test_DirectoryService_suite.test_connect[npes=2][npes=2]>
. end: <Test_DirectoryService_suite.test_connect[npes=2][npes=2]>
Start: <Test_DirectoryService_suite.test_connect_swap_role[npes=2][npes=2]>
So it seems like test_connect_swap_role
might be the "offending" test?
If @weiyuan-jiang can propose any model-type tests to run, I can try that out. But at least c24 on one node has the MPI_Put crash.
I also have a g5_modules
if @weiyuan-jiang or anyone else wants to try to do some testing:
/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.2.4/g5_modules.intel2021_2_0.mv2_236_omnipath
Note that this ONLY works on Skylakes (built for Omni-Path).
Note 2. It looks like the best way to run with MVAPICH2 2.3.6 might be:
export MV2_ENABLE_AFFINITY=0
mpiexec.mpirun_rsh -export -np N ...
I think an equivalent one-liner is:
mpiexec.mpirun_rsh -np N MV2_ENABLE_AFFINITY=0 ...
but I haven't tested that...
Bluh, and of course it runs the Intel MPI Benchmarks for Put just fine. (At least RMA and EXT.)
I believe that is the problem of MPi_put. There are two Mpi_put calls, one is in MpiLock ( directory service) , the other is in MpiServer. There is no Mpi_put call in multigroup server. You can try to setup a multigroupserver to see if it passes. There are conflict info here: The program passes Directory Service ( one mpi_put call) but fails in the test.
I believe that is the problem of MPi_put. There are two Mpi_put calls, one is in MpiLock ( directory service) , the other is in MpiServer. There is no Mpi_put call in multigroup server. You can try to setup a multigroupserver to see if it passes. There are conflict info here: The program passes Directory Service ( one mpi_put call) but fails in the test.
Okay. I just tried it in my tester and it looks like you can't run multigroup with no IOnodes. Let me try getting an extra node here...
Ah. It works! Woo!
Things I just learned:
--oserver_type multigroup
without any ionodes (crash)--oserver_type multigroup
without using --npes_backend_pernode N
(crash)Well, this at least gives @aoloso and myself a way to test things out. You have to do a few things to get it happy, but it doesn't seem to crash with multigroup
.
@weiyuan-jiang Do you have a way to maybe get it to work without IOnodes (for low-res runs)?
In any event it should at worst fail with a useful error message and clean termination. Is it giving a seg fault or something?
@mathomp4 I have been thinking about retiring MpiServer ( by default when there is no IOnodes). It is possible, but may need more changes. I will give a more serous look at this issue
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.
I'm going to label this long-term as it's essentially unsolvable until we can get an MVAPICH2 person on Discover. And @tclune knows the pain of that. (Maybe we should add a "long term due to bureaucracy" tag... :( )
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.
Closing due to inactivity
I'm reopening and marking longterm. I will get MVAPICH2 to work again!
When trying to run C48 GEOSgcm with Intel 19.1.1 and MVAPICH2 2.3.4, if you enable history at all, the model will crash with:
I also tried running the MAPL unit tests with the
develop
branch and found:and then it just locks up. But the "no serverthread" here and the
ServerThread.F90
above...maybe?To use MVAPICH2, I have a
g5_modules
for it: