geodynamics / aspect

A parallel, extensible finite element code to simulate convection in both 2D and 3D models.
https://aspect.geodynamics.org/
Other
224 stars 235 forks source link

TACC Singularity container in RHEL getting Fatal error in PMPI_Init_thread: Other MPI error, error stack #5137

Open SomePersonSomeWhereInTheWorld opened 1 year ago

SomePersonSomeWhereInTheWorld commented 1 year ago

Using geodynamics/aspect:latest-tacc, with Singularity version 3.7.1 on RHEL 8 with OpenMPI 4.1.5a1

$ singularity -v run aspect_latest-tacc.sif aspect-release slab_detachment.prm 
VERBOSE: Not forwarding SINGULARITY_TMPDIR environment variable
VERBOSE: Not forwarding SINGULARITY_BINDPATH environment variable
VERBOSE: Setting HOME=/path/to/me
VERBOSE: Setting PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
VERBOSE: Set messagelevel to: 4
VERBOSE: Starter initialization
VERBOSE: Check if we are running as setuid
VERBOSE: Drop root privileges
VERBOSE: Drop root privileges permanently
VERBOSE: Spawn stage 1
VERBOSE: Execute stage 1
VERBOSE: stage 1 exited with status 0
VERBOSE: Get root privileges
VERBOSE: Change filesystem uid to 547289
VERBOSE: Spawn master process
VERBOSE: Create mount namespace
VERBOSE: Entering in mount namespace
VERBOSE: Create mount namespace
VERBOSE: Spawn RPC server
VERBOSE: Execute master process
VERBOSE: Serve RPC requests
VERBOSE: Default mount: /proc:/proc
VERBOSE: Default mount: /sys:/sys
VERBOSE: Default mount: /dev:/dev
VERBOSE: Found 'bind path' = /etc/localtime, /etc/localtime
VERBOSE: Found 'bind path' = /etc/hosts, /etc/hosts
VERBOSE: Default mount: /tmp:/tmp
VERBOSE: Default mount: /var/tmp:/var/tmp
VERBOSE: Default mount: /etc/resolv.conf:/etc/resolv.conf
VERBOSE: Checking for template passwd file: /burg/opt/singularity-3.7/var/singularity/mnt/session/rootfs/etc/passwd
VERBOSE: Creating passwd content
VERBOSE: Creating template passwd file and appending user data: /burg/opt/singularity-3.7/var/singularity/mnt/session/rootfs/etc/passwd
VERBOSE: Default mount: /etc/passwd:/etc/passwd
VERBOSE: Checking for template group file: /path/to/singularity-3.7/var/singularity/mnt/session/rootfs/etc/group
VERBOSE: Creating group content
VERBOSE: Default mount: /etc/group:/etc/group
VERBOSE: /path/to/me found within container
VERBOSE: rpc server exited with status 0
VERBOSE: Execute stage 2
Abort(2664079) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1421): 
MPIDU_bc_table_create(311)...: 

I also took at shot at using the Docker image and pulling it into Singularity

singularity run aspect.sif aspect-release slab_detachment.prm 
[g241:3607246] OPAL ERROR: Unreachable in file ext3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[g241:3607246] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

That's likely from not having the same version of OpenMPI, am I correct?

gassmoeller commented 1 year ago

Hi @RobbieTheK, sorry for being slow to respond. Yes, both issues seem to be related to the interaction between the images and the system you are running the images on. In order to say more I would need to know more about the system you are using, but here are some general information about the images:

SomePersonSomeWhereInTheWorld commented 1 year ago

Re: 1st error, correct not a TACC system was just trying to see if it would work.

This is a Bright Computing 9.1 cluster running RHEL 8 with Slurm 20, openmpi/gcc/64/4.1.5a1

I'm using an interactive srun job -c4 -n4 as options.

mpirun -np 4 echo hello hello
hello
hello
hello

Same error:

singularity run aspect.sif aspect-release slab_detachment.prm
[g225:3682368] OPAL ERROR: Unreachable in file ext3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[g225:3682368] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
gassmoeller commented 1 year ago

I'm using an interactive srun job -c4 -n4 as options.

I have not tried using srun directly on this image. Can you instead set up a batch script that you run with sbatch or run the command interactively on a development node? The message seems to say that you need to compile MPI in a specific way to use srun and our docker image was not set up that way.

bangerth commented 1 year ago

@RobbieTheK Can I assume that using a batch script solved the problem?

SomePersonSomeWhereInTheWorld commented 1 year ago

No I ended up building and compiling candi and deal II. I'd be happy to try again with sbatch but I mentioned I did try with srun to no avail.

bangerth commented 1 year ago

What I meant to ask is whether you found a way to make it work for you?

SomePersonSomeWhereInTheWorld commented 1 year ago

Yes, by installing deal II but it would be nice for future users to have a container option. Happy to test suggestions.

On Fri, Jul 14, 2023, 1:37 PM Wolfgang Bangerth @.***> wrote:

What I meant to ask is whether you found a way to make it work for you?

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geodynamics_aspect_issues_5137-23issuecomment-2D1636175496&d=DwMCaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=-oUq7yBhhUP84-IIbiMP5GiJzBtM3RcKGAypJ39NpuOJkSFKWsr8BSgzEE1UDK8Y&s=p01DP47BowKsDRubj3EZ1hks-3ZMngRCRzL8w0rsBlg&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFBY5CZPWNUWHGXJ5GZL7U3XQF7WNANCNFSM6AAAAAAXBWGEGU&d=DwMCaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=-oUq7yBhhUP84-IIbiMP5GiJzBtM3RcKGAypJ39NpuOJkSFKWsr8BSgzEE1UDK8Y&s=zrETeP0-UTXBBpPML191CrfZcEuIQ79iCPGf9iW8ST0&e= . You are receiving this because you were mentioned.Message ID: @.***>

bangerth commented 1 year ago

I don't know that I have anything to offer. I don't know much about singularity (or containers in general) and I don't work on the TACC machines. I'm also not sure we have the resources as a project to really figure this out.

@gassmoeller @tjhei Do you have anything to offer? Or should we just say "We'd love to provide this, but we can't" and close the issue?