easybuilders / easybuild

EasyBuild - building software with ease
http://easybuild.io
GNU General Public License v2.0
464 stars 143 forks source link

UCX error in OpenMPI-4.1.1 foss-2021a build #756

Open connorourke opened 2 years ago

connorourke commented 2 years ago

During the build of OpenMPI-4.1.1 with the foss-2021a toolchain I get the following error:

[1636483897.276373] [ip-AC125812:109544:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc00000084001abe5: Shared memory error
[ip-AC125812:109544] pml_ucx.c:419  Error: ucp_ep_create(proc=0) failed: Shared memory error
[1636483897.280964] [ip-AC125812:109542:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/109541/fd/33 flags=0x0) failed: No such file or directory
[1636483897.281006] [ip-AC125812:109542:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc00000084001abe5: Shared memory error
[ip-AC125812:109542] pml_ucx.c:419  Error: ucp_ep_create(proc=0) failed: Shared memory error
[1636483897.281576] [ip-AC125812:109543:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/109541/fd/33 flags=0x0) failed: No such file or directory
[1636483897.281602] [ip-AC125812:109543:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc00000084001abe5: Shared memory error
[ip-AC125812:109543] pml_ucx.c:419  Error: ucp_ep_create(proc=0) failed: Shared memory error
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[ip-AC125812:109544] *** An error occurred in MPI_Init
[ip-AC125812:109544] *** reported by process [2187460609,3]
[ip-AC125812:109544] *** on a NULL communicator
[ip-AC125812:109544] *** Unknown error
[ip-AC125812:109544] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-AC125812:109544] ***    and potentially your MPI job)
[ip-AC125812:109519] 2 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[ip-AC125812:109519] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-AC125812:109519] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
) (at easybuild/framework/easyblock.py:3311 in _sanity_check_step)
== 2021-11-09 18:51:42,522 build_log.py:265 INFO ... (took 17 secs)
== 2021-11-09 18:51:42,522 filetools.py:1971 INFO Removing lock /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/software/.locks/_scratch_cor22_bin_BUILD_EB_janus_easybuild_instances_hbv2_2021a_software_OpenMPI_4.1.1-GCC-10.3.0.lock...
== 2021-11-09 18:51:42,530 filetools.py:380 INFO Path /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/software/.locks/_scratch_cor22_bin_BUILD_EB_janus_easybuild_instances_hbv2_2021a_software_OpenMPI_4.1.1-GCC-10.3.0.lock successfully removed.
== 2021-11-09 18:51:42,530 filetools.py:1975 INFO Lock removed: /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/software/.locks/_scratch_cor22_bin_BUILD_EB_janus_easybuild_instances_hbv2_2021a_software_OpenMPI_4.1.1-GCC-10.3.0.lock
== 2021-11-09 18:51:42,530 easyblock.py:3915 WARNING build failed (first 300 chars): Sanity check failed: sanity check command mpirun -n 4 /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/build/OpenMPI/4.1.1/GCC-10.3.0/mpi_test_hello_usempi exited with code 1 (output: [1636483897.276312] [ip-AC125812:109544:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/109
== 2021-11-09 18:51:42,531 easyblock.py:307 INFO Closing log for application name OpenMPI version 4.1.1

Looks like it is to do with UCX trying to open up a non-existent file.

Has anyone seen this error and knows a fix?

boegel commented 2 years ago

@connorourke This looks a lot like the problem reported upstream at https://github.com/openucx/ucx/issues/4224 .

Are you running in a user namespace?

connorourke commented 2 years ago

Nope - not running in a user namespace @boegel.

vanzod commented 2 years ago

@boegel I just hit the exact same issue with the foss-2021b toolchain. The strange thing is that it happens on certain machines while on others it builds smoothly.

@connorourke On which hardware were you trying to build it?

connorourke commented 2 years ago

It was on an AMD Milan EPYC 7V13.

hezhiqiang8909 commented 2 years ago

== FAILED: Installation ended unsuccessfully (build directory: /public/software/.local/easybuild/build/OpenMPI/4.1.1/GCC-10.3.0): build failed (first 300 chars): Sanity check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 4 /public/software/.local/easybuild/build/OpenMPI/4.1.1/GCC-10.3.0/mpi_test_hello_c exited with code 1