E4S-Project / testsuite

E4S test suite with validation tests
MIT License
19 stars 31 forks source link

Precice test fails on perlmutter #50

Open wspear opened 1 year ago

wspear commented 1 year ago

@MakisH @fsimonis

The precice test defined here: https://github.com/E4S-Project/testsuite/tree/master/validation_tests/precice

Fails on perlmutter for this variant installed with e4s 22.11:

-- linux-sles15-zen3 / gcc@11.2.0 -------------------------------
precice@2.5.0~ipo+mpi+petsc~python+shared build_system=cmake build_type=RelWithDebInfo

With the following console output:

DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverOne", and mesh name "MeshOne".
preCICE:^[[0m This is preCICE version 2.5.0
preCICE:^[[0m Revision info: no-info [git failed to run]
preCICE:^[[0m Build type: Release (without debug log)
preCICE:^[[0m Configuring preCICE with configuration "precice-config.xml"
preCICE:^[[0m I am participant "SolverOne"
preCICE:^[[0m Setting up primary communication to coupling partner/s
MPICH ERROR [Rank 0] [job id ] [Mon Nov 21 12:25:26 2022] [nid001032] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)

DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverTwo", and mesh name "MeshTwo".
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)
~                               
fsimonis commented 1 year ago

Are there any specifics of the spec used to build preCICE?

Also as a note, the tilde in the variant doesn't work well in markdown. I suggest to wrap it in a code block.

wspear commented 1 year ago

@fsimonis Fixed that variant. Here is the full dependency tree. Is there anything else that would help pin this down?

-- linux-sles15-zen3 / gcc@11.2.0 -------------------------------
egt4cn6 precice@2.5.0~ipo+mpi+petsc~python+shared build_system=cmake build_type=RelWithDebInfo
5sebukm     boost@1.80.0~atomic~chrono~clanglibcpp~container~context~contract~coroutine~date_time~debug~exception~fiber+filesystem~graph~graph_parallel~icu~iostreams~json~locale+log~math+mpi+multithreaded~nowide~numpy~pic+program_options~python~random~regex~serialization+shared~signals~singlethreaded~stacktrace+system~taggedlayout+test+thread~timer~type_erasure~versionedlayout~wave build_system=generic cxxstd=98 patches=a440f96 visibility=hidden
bnyqmik         cray-mpich@8.1.17+wrappers build_system=generic
4allaay     cmake@3.24.2~doc+ncurses~ownlibs~qt build_system=generic build_type=Release
s5aelxi         curl@7.85.0~gssapi~ldap~libidn2~librtmp~libssh~libssh2~nghttp2 build_system=autotools libs=shared,static tls=gnutls
56brvrh             gnutls@3.7.8~guile+zlib build_system=autotools
s3iopwe                 gettext@0.21.1+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools
g2bpsoz                     bzip2@1.0.8~debug~pic+shared build_system=generic
rnafwos                         diffutils@3.8 build_system=autotools
xfogkcu                             libiconv@1.16 build_system=autotools libs=shared,static
jbbwlo5                     libxml2@2.10.1~python build_system=autotools
savxweu                         pkgconf@1.8.0 build_system=autotools
yucs7bj                         xz@5.2.7+pic build_system=autotools libs=shared,static
76b2zrq                         zlib@1.2.13+optimize+pic+shared build_system=makefile
igbrz2c                     ncurses@6.3~symlinks+termlib abi=none build_system=autotools
a35zenx                     tar@1.34 build_system=autotools zip=pigz
dmtmfzy                         pigz@2.7 build_system=makefile
crilnoq                         zstd@1.5.2+programs build_system=makefile compression=none libs=shared,static
7sx44ru                 libidn2@2.3.0 build_system=autotools
omjzrqu                     libunistring@0.9.10 build_system=autotools
rv7bhhx                 nettle@3.8.1 build_system=autotools
5n3nphp                     gmp@6.2.1 build_system=autotools libs=shared,static
4my7pdm                         autoconf@2.69 build_system=autotools patches=35c4492,7793209,a49dd5b
yasn2hy                             m4@1.4.19+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7
ni76haj                                 libsigsegv@2.13 build_system=autotools
ucjrwtm                             perl@5.36.0+cpanm+shared+threads build_system=generic
gqdvawb                                 berkeley-db@18.1.40+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc
otqsxvg                                 gdbm@1.23 build_system=autotools
6mvf2em                                     readline@8.1.2 build_system=autotools
t3onfyz                         automake@1.16.5 build_system=autotools
xyihrmc                         libtool@2.4.7 build_system=autotools
jfoyxbd         expat@2.4.8+libbsd build_system=autotools
uo7vnpu             libbsd@0.11.5 build_system=autotools
bcya2vp                 libmd@1.0.4 build_system=autotools
di26ddu         libarchive@3.5.2+iconv build_system=autotools compression=bz2lib,lz4,lzma,lzo2,zlib,zstd crypto=mbedtls libs=shared,static programs=none xar=expat
z67fidq             lz4@1.9.4 build_system=makefile libs=shared,static
7a4tsiy             lzo@2.10 build_system=autotools libs=shared,static
mskuajx             mbedtls@2.28.0+pic build_system=makefile build_type=Release libs=static
k5mmyyz         libuv@1.44.1 build_system=autotools
qrkehbg         rhash@1.4.2 build_system=makefile patches=093518c,3fbfe46
wzlxfkh     eigen@3.4.0~ipo build_system=cmake build_type=RelWithDebInfo
bpqapvu     petsc@3.18.1~X~batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre~int64~jpeg~knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi~mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws~scalapack+shared~strumpack~suite-sparse+superlu-dist~tetgen~trilinos~valgrind build_system=generic clanguage=C
qztwosa         cray-libsci@21.08.1.2+mpi~openmp+shared build_system=generic
vw5amky         hdf5@1.12.2~cxx+fortran+hl~ipo~java+mpi+shared~szip~threadsafe+tools api=default build_system=cmake build_type=RelWithDebInfo
dbfenpi         hypre@2.26.0~complex~cuda~debug+fortran~gptune~int64~internal-superlu~mixedint+mpi~openmp~rocm+shared~superlu-dist~umpire~unified-memory build_system=autotools
f2t5phj         metis@5.1.0~gdb~int64~ipo~real64+shared build_system=cmake build_type=RelWithDebInfo patches=4991da9,93a7903,b1225da
4wdj56e         parmetis@4.0.3~gdb~int64~ipo+shared build_system=cmake build_type=RelWithDebInfo patches=4f89253,50ed208,704b84f
iespikt         python@3.7.15+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3~ssl~tix~tkinter~ucs4+uuid+zlib build_system=generic patches=0d98e93,f2fd060
kd4a5vc             libffi@3.4.2 build_system=autotools
czakzn2             sqlite@3.39.4+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools
4qyr4mp             util-linux-uuid@2.38.1 build_system=autotools
auhajyt         superlu-dist@8.1.1~cuda~int64~ipo~openmp~rocm+shared build_system=cmake build_type=RelWithDebInfo
wspear commented 1 year ago

It looks like this was caused by a poisoned runtime environment. This error doesn't appear on a fresh run node.

wspear commented 1 year ago

@fsimonis I resolved this too quickly. I get a hang/timeout the run output in a clean environment:

DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverOne", and mesh name "MeshOne".
DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverTwo", and mesh name "MeshTwo".
MPICH ERROR [Rank 0] [job id ] [Tue Dec  6 15:18:11 2022] [nid001901] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......: 
MPID_Init(495)..............: 
MPIDI_OFI_mpi_init_hook(816): 
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)

aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......: 
MPID_Init(495)..............: 
MPIDI_OFI_mpi_init_hook(816): 
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)
wspear@nid001901:~/SPACK-SPACE/wspear/perlmutter/22.11/gnu/testsuite/validation_tests/precice> preCICE: This is preCICE version 2.5.0
preCICE: Revision info: no-info [git failed to run]
preCICE: Build type: Release (without debug log)
preCICE: Configuring preCICE with configuration "precice-config.xml"
preCICE: I am participant "SolverOne"
preCICE: Setting up primary communication to coupling partner/s
fsimonis commented 1 year ago

Both solvers fail in MPI_INIT with the same error: create_endpoint:Address already in use.

Given that both of them fail with the same error, I expect that this is some kind of problem in the environment.

We don't do any fancy things in preCICE, so this should be reproducible with any dummy MPI code.

wspear commented 1 year ago

I'm seeing the same issue on Crusher. Error and variants/dependencies for the crusher install are below. This is in a clean environment (basically all I've done is spack load precice) with other MPI based products generally testing successfully.

kipping load: Environment already setup
MPICH ERROR [Rank 0] [job id ] [Fri Nov 18 11:40:41 2022] [crusher131] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)

DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverOne", and mesh name "MeshOne".
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)
DUMMY: Running solver dummy with preCICE config file "precice-config.xml", participant name "SolverTwo", and mesh name "MeshTwo".
preCICE:^[[0m This is preCICE version 2.5.0
preCICE:^[[0m Revision info: no-info [git failed to run]
preCICE:^[[0m Build type: Release (without debug log)
preCICE:^[[0m Configuring preCICE with configuration "precice-config.xml"
preCICE:^[[0m I am participant "SolverTwo"
preCICE:^[[0m Setting up primary communication to coupling partner/s
~                 
-- linux-sles15-zen3 / gcc@11.2.0 -------------------------------
2weu3di precice@2.5.0~ipo+mpi+petsc~python+shared build_system=cmake build_type=RelWithDebInfo
trtrf3b     boost@1.80.0~atomic~chrono~clanglibcpp~container~context~contract~coroutine~date_time~debug~exception~fiber+filesystem~graph~graph_parallel~icu~iostreams~json~locale+log~math+mpi+multithreaded~nowide~numpy~pic+program_options~python~random~regex~serialization+shared~signals~singlethreaded~stacktrace+system~taggedlayout+test+thread~timer~type_erasure~versionedlayout~wave build_system=generic cxxstd=98 patches=a440f96 visibility=hidden
oaykapp         cray-mpich@8.1.17+wrappers build_system=generic
c6gpjyk     cmake@3.24.2~doc+ncurses+ownlibs~qt build_system=generic build_type=Release
igbrz2c         ncurses@6.3~symlinks+termlib abi=none build_system=autotools
savxweu             pkgconf@1.8.0 build_system=autotools
kq7i44v         openssl@1.1.1s~docs~shared build_system=generic certs=mozilla
6ki4n47             ca-certificates-mozilla@2022-10-11 build_system=generic
ucjrwtm             perl@5.36.0+cpanm+shared+threads build_system=generic
gqdvawb                 berkeley-db@18.1.40+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc
g2bpsoz                 bzip2@1.0.8~debug~pic+shared build_system=generic
rnafwos                     diffutils@3.8 build_system=autotools
xfogkcu                         libiconv@1.16 build_system=autotools libs=shared,static
otqsxvg                 gdbm@1.23 build_system=autotools
6mvf2em                     readline@8.1.2 build_system=autotools
76b2zrq                 zlib@1.2.13+optimize+pic+shared build_system=makefile
3oefhug     eigen@3.4.0~ipo build_system=cmake build_type=RelWithDebInfo
jbbwlo5     libxml2@2.10.1~python build_system=autotools
yucs7bj         xz@5.2.7+pic build_system=autotools libs=shared,static
hn5xr53     petsc@3.18.1~X+batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre~int64~jpeg~knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi~mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws~scalapack+shared~strumpack~suite-sparse+superlu-dist~tetgen~trilinos~valgrind build_system=generic clanguage=C
dc5jfan         hdf5@1.12.2~cxx+fortran+hl~ipo~java+mpi+shared~szip~threadsafe+tools api=default build_system=cmake build_type=RelWithDebInfo
e5s4iy7         hypre@2.26.0~complex~cuda~debug+fortran~gptune~int64~internal-superlu~mixedint+mpi~openmp~rocm+shared~superlu-dist~umpire~unified-memory build_system=autotools
bgpvt5g             openblas@0.3.21~bignuma~consistent_fpcsr+fortran~ilp64+locking+pic+shared build_system=makefile patches=d3d9b15 symbol_suffix=none threads=openmp
jfxbkfk         metis@5.1.0~gdb~int64~ipo~real64+shared build_system=cmake build_type=RelWithDebInfo patches=4991da9,93a7903,b1225da
f3ztx6d         parmetis@4.0.3~gdb~int64~ipo+shared build_system=cmake build_type=RelWithDebInfo patches=4f89253,50ed208,704b84f
du4hbnl         python@3.7.15+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib build_system=generic patches=0d98e93,f2fd060
jfoyxbd             expat@2.4.8+libbsd build_system=autotools
uo7vnpu                 libbsd@0.11.5 build_system=autotools
bcya2vp                     libmd@1.0.4 build_system=autotools
s3iopwe             gettext@0.21.1+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools
a35zenx                 tar@1.34 build_system=autotools zip=pigz
dmtmfzy                     pigz@2.7 build_system=makefile
crilnoq                     zstd@1.5.2+programs build_system=makefile compression=none libs=shared,static
kd4a5vc             libffi@3.4.2 build_system=autotools
czakzn2             sqlite@3.39.4+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools
4qyr4mp             util-linux-uuid@2.38.1 build_system=autotools
kzjsqlm         superlu-dist@8.1.1~cuda~int64~ipo~openmp~rocm+shared build_system=cmake build_type=RelWithDebInfo
fsimonis commented 1 year ago

We test MPICH in our CI using fedora, which is still at version 34 (mpich 3.4.1). I'll upgrade to fedora 37 (mpich 4.0.2) and see if this succeeds. In the meanwhile, I'll build precice 2.5.0 using the newest spack with mpich to see if that succeeds on my workstation. Then I'll get back to you.

Have you tried launching multiple other MPI programs simultaneously to see if the system can handle this? We experienced problems on the SuperMUC(-NG) with multiple MPI programs running simultaneously on the same slots, whilst spanning multiple nodes. This could be another symptom of the same problem. (Of course this is more of a guess, as you don't actually run the solverdummies with mpirun. )

fsimonis commented 1 year ago

Your test runs fine locally with:

spack --version
0.20.0.dev0 (7056a4bffd8f37615bc5efee8f02a400dceaec5c)

Using the spec:

-- linux-archrolling-zen3 / gcc@12.2.0 --------------------------
t4mqo7z precice@2.5.0~ipo+mpi+petsc~python+shared build_system=cmake build_type=RelWithDebInfo
us4udt5     boost@1.80.0~atomic~chrono~clanglibcpp~container~context~contract~coroutine~date_time~debug~exception~fiber+filesystem~graph~graph_parallel~icu~iostreams~json~locale+log~math~mpi+multithreaded~nowide~numpy~pic+program_options~python~random~regex~serialization+shared~signals~singlethreaded~stacktrace+system~taggedlayout+test+thread~timer~type_erasure~versionedlayout~wave build_system=generic cxxstd=98 patches=a440f96 visibility=hidden
7xgan6m     cmake@3.24.1~doc+ncurses+ownlibs~qt build_system=generic build_type=Release
2tmrrpw     eigen@3.4.0~ipo build_system=cmake build_type=RelWithDebInfo
vy67cbo     libxml2@2.10.3~python build_system=autotools
6ltr5dl         libiconv@1.16 build_system=autotools libs=shared,static
5ggmxkn         xz@5.2.7~pic build_system=autotools libs=shared,static
dpj4bms         zlib@1.2.13+optimize+pic+shared build_system=makefile
zq4eoyj     mpich@4.0.2~argobots~cuda+fortran+hwloc+hydra+libxml2+pci~rocm+romio~slurm~two_level_namespace~vci~verbs+wrapperrpath build_system=autotools datatype-engine=auto device=ch4 netmod=ofi patches=d4c0e99 pmi=pmi
53lepuv         findutils@4.9.0 build_system=autotools patches=440b954
wnrcksl         hwloc@2.8.0~cairo~cuda~gl~libudev+libxml2~netloc~nvml~oneapi-level-zero~opencl+pci~rocm build_system=autotools libs=shared,static
tehwqeo             ncurses@6.3~symlinks+termlib abi=none build_system=autotools
kq4iabz         libfabric@1.16.1~debug~kdreg build_system=autotools fabrics=sockets,tcp,udp
jidochn         libpciaccess@0.16 build_system=autotools
ehr3efd             libtool@2.4.7-dirty build_system=autotools
53b3qec             util-macros@1.19.3 build_system=autotools
yy5vpjv         yaksa@0.2~cuda~rocm build_system=autotools
bd6cvfl             autoconf@2.71 build_system=autotools
s3bwkg4             automake@1.16.5 build_system=autotools
ztsqh6m             m4@1.4.19+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7
cehm5ed     petsc@3.18.2~X~batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre~int64~jpeg~knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi~mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws~scalapack+shared~strumpack~suite-sparse+superlu-dist~tetgen~trilinos~valgrind build_system=generic clanguage=C
4ivuxig         diffutils@3.8 build_system=autotools
jyct3ow         hdf5@1.12.2~cxx~fortran~hl~ipo~java+mpi+shared~szip~threadsafe+tools api=default build_system=cmake build_type=RelWithDebInfo
lamojl4         hypre@2.26.0~complex~cuda~debug+fortran~gptune~int64~internal-superlu~mixedint+mpi~openmp~rocm+shared~superlu-dist~umpire~unified-memory build_system=autotools
gkgqte5         metis@5.1.0~gdb~int64~ipo~real64+shared build_system=cmake build_type=RelWithDebInfo patches=4991da9,93a7903,b1225da
23ihaez         openblas@0.3.21~bignuma~consistent_fpcsr+fortran~ilp64+locking+pic+shared build_system=makefile patches=d3d9b15 symbol_suffix=none threads=none
3wvtlf6             perl@5.36.0+cpanm+shared+threads build_system=generic
fogt6mt                 berkeley-db@18.1.40+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc
skhcew2         parmetis@4.0.3~gdb~int64~ipo+shared build_system=cmake build_type=RelWithDebInfo patches=4f89253,50ed208,704b84f
tdmgiza         python@3.10.8+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic patches=0d98e93,7d40923,f2fd060
pzibomt             bzip2@1.0.8~debug~pic+shared build_system=generic
qp2r7iz             expat@2.5.0+libbsd build_system=autotools
eb55tgs                 libbsd@0.11.5 build_system=autotools
h643yv4                     libmd@1.0.4 build_system=autotools
aap5vzx             gdbm@1.23 build_system=autotools
q7goc63             gettext@0.21.1+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools
mlrmz6k                 tar@1.34 build_system=autotools zip=pigz
jndjnxn             libffi@3.4.2 build_system=autotools
ojhzllf             libxcrypt@4.4.33~obsolete_api build_system=autotools
xu5sfij             openssl@1.1.1s~docs~shared build_system=generic certs=mozilla
sqdghw3                 ca-certificates-mozilla@2022-10-11 build_system=generic
crr6ch5             readline@8.2 build_system=autotools patches=bbf97f1
46hrmmf             sqlite@3.40.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools
6fveg3y             util-linux-uuid@2.38.1 build_system=autotools
wc4fllt         superlu-dist@8.1.2~cuda~int64~ipo~openmp~rocm+shared build_system=cmake build_type=RelWithDebInfo
vgr5oe6     pkgconf@1.8.0 build_system=autotools