GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
24 stars 18 forks source link

MVAPICH2 crashes in ServerThread.F90 #502

Open mathomp4 opened 4 years ago

mathomp4 commented 4 years ago

When trying to run C48 GEOSgcm with Intel 19.1.1 and MVAPICH2 2.3.4, if you enable history at all, the model will crash with:

 forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
GEOSgcm.x          000000000596D54A  Unknown               Unknown  Unknown
libpthread-2.22.s  00002AAAB691BC10  Unknown               Unknown  Unknown
libmpi.so.12.1.1   00002AAAB5A57F3F  MPIDI_CH3I_Put        Unknown  Unknown
libmpi.so.12.1.1   00002AAAB5A575C4  MPID_Put              Unknown  Unknown
libmpi.so.12.1.1   00002AAAB59C51DA  MPI_Put               Unknown  Unknown
libmpifort.so.12.  00002AAAB52B1EB7  mpi_put_              Unknown  Unknown
GEOSgcm.x          00000000057A4C36  pfio_serverthread        1061  ServerThread.F90
GEOSgcm.x          000000000577A18A  pfio_baseservermo          69  BaseServer.F90
GEOSgcm.x          00000000057A7511  pfio_serverthread        1135  ServerThread.F90
GEOSgcm.x          0000000005765566  pfio_messagevisit          93  MessageVisitor.F90
GEOSgcm.x          000000000582C54E  pfio_abstractmess         110  AbstractMessage.F90
GEOSgcm.x          00000000057752D1  pfio_simplesocket         105  SimpleSocket.F90
GEOSgcm.x          00000000057B56AD  pfio_clientthread         428  ClientThread.F90
GEOSgcm.x          00000000057BDA01  pfio_clientmanage         340  ClientManager.F90
GEOSgcm.x          0000000005310638  mapl_historygridc        3083  MAPL_HistoryGridComp.F90

I also tried running the MAPL unit tests with the develop branch and found:

1: Test command: /discover/swdev/gmao_SIteam/MPI/mvapich2/2.3.4/intel-19.1.1.217-omnipath/bin/mpirun "-np" "8" "/discover/swdev/mathomp4/Models/MAPL-Develop-MV2/MAPL/build-Release/pfio/tests/MAPL.pfio.tests"
1: Test timeout computed to be: 1500
1: srun: cluster configuration lacks support for cpu binding
1: ................................................................... WARNING: no serverthread

and then it just locks up. But the "no serverthread" here and the ServerThread.F90 above...maybe?

To use MVAPICH2, I have a g5_modules for it:

/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.0.13/g5_modules.intel1911.mv2234
weiyuan-jiang commented 4 years ago

"WARNING: no serverthread" is fine. We do have case without serverthread

mathomp4 commented 4 years ago

Oh. Huh. Well, it still seems to hang, so it's possible it's the test after that that is important?

weiyuan-jiang commented 4 years ago

cmake failed for the g5_modules Could NOT find MPI_C (missing: MPI_C_WORKS) -- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS) -- Could NOT find MPI_Fortran (missing: MPI_Fortran_WORKS) CMake Error at /gpfsm/dulocal/sles12/other/cmake/3.17.0/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:164 (message): Could NOT find MPI (missing: MPI_C_FOUND MPI_CXX_FOUND MPI_Fortran_FOUND) Call Stack (most recent call first):

weiyuan-jiang commented 4 years ago

module list

Currently Loaded Modules: 1) git/2.24.0 5) ImageMagick/7.0.9-16 9) mpi/mvapich2/2.3.4/intel-19.1.1.217-omnipath 2) cmake/3.17.0 6) GEOSenv 10) python/GEOSpyD/Ana2019.10_py2.7 3) other/manage_externals 7) comp/gcc/8.3.0 4) other/mepo 8) comp/intel/19.1.1.217

mathomp4 commented 4 years ago

@weiyuan-jiang I think they might be in the wrong order. You'll need to at least have the compiler and MPI bits in this order:

comp/gcc/8.3.0
comp/intel/19.1.1.217
mpi/mvapich2/2.3.4/intel-19.1.1.217-omnipath
python/GEOSpyD/Ana2019.10_py2.7
mathomp4 commented 3 years ago

Welp, I just built MVAPICH2 2.3.6 for Intel 2021.2 (figured maybe I could try it out on SCU16 when it's around) and I get the same crash:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
GEOSgcm.x          00000000031A756A  Unknown               Unknown  Unknown
libpthread-2.22.s  00002AAAC8D8EC10  Unknown               Unknown  Unknown
libmpi.so.12.1.1   00002AAAC825F14B  MPIDI_CH3I_Put        Unknown  Unknown
libmpi.so.12.1.1   00002AAAC825E7D4  MPID_Put              Unknown  Unknown
libmpi.so.12.1.1   00002AAAC81CC3DA  MPI_Put               Unknown  Unknown
libmpifort.so.12.  00002AAAC7CCFEC7  mpi_put_              Unknown  Unknown
libMAPL.pfio.so    00002AAAC094A9B5  pfio_serverthread         906  ServerThread.F90
libMAPL.pfio.so    00002AAAC08E86B7  pfio_baseservermo          69  BaseServer.F90
libMAPL.pfio.so    00002AAAC094D38A  pfio_serverthread         980  ServerThread.F90
libMAPL.pfio.so    00002AAAC08AD0D6  pfio_messagevisit          93  MessageVisitor.F90

Line changed, but probably the same MPI_Put call:

https://github.com/GEOS-ESM/MAPL/blob/f654c7065013d320079b6e333d95b7f929832ccb/pfio/ServerThread.F90#L906

I also tried the MAPL tests and, again, MAPL.pfio.tests "locks up". Running with -d:

 Start: <Test_DirectoryService_suite.test_put_directory[npes=1][npes=1]>
.   end: <Test_DirectoryService_suite.test_put_directory[npes=1][npes=1]>

 Start: <Test_DirectoryService_suite.test_publish[npes=1][npes=1]>
.   end: <Test_DirectoryService_suite.test_publish[npes=1][npes=1]>

 Start: <Test_DirectoryService_suite.test_connect[npes=2][npes=2]>
.   end: <Test_DirectoryService_suite.test_connect[npes=2][npes=2]>

 Start: <Test_DirectoryService_suite.test_connect_swap_role[npes=2][npes=2]>

So it seems like test_connect_swap_role might be the "offending" test?

mathomp4 commented 3 years ago

If @weiyuan-jiang can propose any model-type tests to run, I can try that out. But at least c24 on one node has the MPI_Put crash.

mathomp4 commented 3 years ago

I also have a g5_modules if @weiyuan-jiang or anyone else wants to try to do some testing:

/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.2.4/g5_modules.intel2021_2_0.mv2_236_omnipath

Note that this ONLY works on Skylakes (built for Omni-Path).

Note 2. It looks like the best way to run with MVAPICH2 2.3.6 might be:

export MV2_ENABLE_AFFINITY=0
mpiexec.mpirun_rsh -export -np N ...

I think an equivalent one-liner is:

mpiexec.mpirun_rsh -np N MV2_ENABLE_AFFINITY=0 ...

but I haven't tested that...

mathomp4 commented 3 years ago

Bluh, and of course it runs the Intel MPI Benchmarks for Put just fine. (At least RMA and EXT.)

weiyuan-jiang commented 3 years ago

I believe that is the problem of MPi_put. There are two Mpi_put calls, one is in MpiLock ( directory service) , the other is in MpiServer. There is no Mpi_put call in multigroup server. You can try to setup a multigroupserver to see if it passes. There are conflict info here: The program passes Directory Service ( one mpi_put call) but fails in the test.

mathomp4 commented 3 years ago

I believe that is the problem of MPi_put. There are two Mpi_put calls, one is in MpiLock ( directory service) , the other is in MpiServer. There is no Mpi_put call in multigroup server. You can try to setup a multigroupserver to see if it passes. There are conflict info here: The program passes Directory Service ( one mpi_put call) but fails in the test.

Okay. I just tried it in my tester and it looks like you can't run multigroup with no IOnodes. Let me try getting an extra node here...

mathomp4 commented 3 years ago

Ah. It works! Woo!

Things I just learned:

  1. Don't run --oserver_type multigroup without any ionodes (crash)
  2. Don't run --oserver_type multigroup without using --npes_backend_pernode N (crash)

Well, this at least gives @aoloso and myself a way to test things out. You have to do a few things to get it happy, but it doesn't seem to crash with multigroup.

@weiyuan-jiang Do you have a way to maybe get it to work without IOnodes (for low-res runs)?

tclune commented 3 years ago

In any event it should at worst fail with a useful error message and clean termination. Is it giving a seg fault or something?

weiyuan-jiang commented 3 years ago

@mathomp4 I have been thinking about retiring MpiServer ( by default when there is no IOnodes). It is possible, but may need more changes. I will give a more serous look at this issue

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

mathomp4 commented 3 years ago

I'm going to label this long-term as it's essentially unsolvable until we can get an MVAPICH2 person on Discover. And @tclune knows the pain of that. (Maybe we should add a "long term due to bureaucracy" tag... :( )

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

stale[bot] commented 3 years ago

Closing due to inactivity

mathomp4 commented 3 years ago

I'm reopening and marking longterm. I will get MVAPICH2 to work again!