E4S-Project / e4s-cl

Container manager for E4S
https://e4s-cl.readthedocs.io
MIT License
14 stars 3 forks source link

Bug report using gcc and impi on NOAA hera system #123

Open thomas-robinson opened 2 months ago

thomas-robinson commented 2 months ago

While trying to run an e4s-cl init I received an error that said it was an e4s-cl bug, and to report the contents of a debug file on github. Below is the pasted contents of the file:

$ cat /home/Thomas.Robinson/.local/e4s_cl/logs/debug_log
 [Debug root:483] 
########################################################################################################################################################
E4S CONTAINER LAUNCHER LOGGING INITIALIZED

Timestamp         : 2024-09-03 13:17:23.794833
Hostname          : hfe05
Platform          : Linux-4.18.0-477.27.1.el8_8.88ciq_lts.0.1.x86_64-x86_64-with-glibc2.28
Version           : 1.0.5.dev1+g35e5e6a
Python Version    : 3.12.4
Working Directory : /scratch2/GFDL/e4s/Thomas.Robinson/containers
Terminal Size     : 152x32
Frozen            : False
Log ID            : 0ad97a938a1a609713e75e9db3edb9d99134fa24084f2412fc43ef7cb0037359
########################################################################################################################################################

[Debug e4s_cl.cli.commands.__main__:77] e4s-cl args: Namespace(command='init', options=['--profile', 'gfdl2024.01', '--launcher', 'srun', '--backend', 'singularity', '--image', '/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif'], dry_run=None)
[Debug e4s_cl.cli.commands.init:77] e4s-cl init args: Namespace(profile_name='gfdl2024.01', launcher='/apps/slurm/default/bin/srun', backend='singularity', image='/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif', cmd=[])
[Debug e4s_cl.cf.storage.local_file:50] '/home/Thomas.Robinson/.local/e4s_cl/user.json' opened read-write
[Debug e4s_cl.cf.storage.local_file:170] Initialized user database '/home/Thomas.Robinson/.local/e4s_cl/user.json'
[+] Tracing MPI execution using:
[+] '/apps/slurm/default/bin/srun /scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'
[Debug e4s_cl.cli.commands.profile.detect:77] e4s-cl profile detect args: Namespace(profile_name=None, cmd=['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'])
[Debug e4s_cl.util:211] Running with parent status: ['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/python', '/scratch2/GFDL/e4s/bin/bin/e4s-cl', 'profile', 'detect', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester']
Failed to determine necessary libraries: program exited with code 1
[+] Attach <PtraceProcess #2397590> to debugger
[+] Set <PtraceProcess #2397590> options to 1
[+] Created profile gfdl2024.01
[Debug root:483] 
########################################################################################################################################################
E4S CONTAINER LAUNCHER LOGGING INITIALIZED

Timestamp         : 2024-09-03 13:18:38.549017
Hostname          : h11c53
Platform          : Linux-4.18.0-477.27.1.el8_8.88ciq_lts.0.1.x86_64-x86_64-with-glibc2.28
Version           : 1.0.5.dev1+g35e5e6a
Python Version    : 3.12.4
Working Directory : /scratch2/GFDL/e4s/Thomas.Robinson/containers
Terminal Size     : 152x32
Frozen            : False
Log ID            : befd6bd2404fe811dc2d9f4e42d7d451d4c2ba762934fc55b94067328a6687f0
########################################################################################################################################################

[Debug e4s_cl.cli.commands.__main__:77] e4s-cl args: Namespace(command='init', options=['--profile', 'gfdl2024.01', '--launcher', 'srun', '--backend', 'singularity', '--image', '/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif'], dry_run=None)
[Debug e4s_cl.cli.commands.init:77] e4s-cl init args: Namespace(profile_name='gfdl2024.01', launcher='/apps/slurm/default/bin/srun', backend='singularity', image='/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif', cmd=[])
[Debug e4s_cl.cf.storage.local_file:50] '/home/Thomas.Robinson/.local/e4s_cl/user.json' opened read-write
[Debug e4s_cl.cf.storage.local_file:170] Initialized user database '/home/Thomas.Robinson/.local/e4s_cl/user.json'
[+] Tracing MPI execution using:
[+] '/apps/slurm/default/bin/srun /scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'
[Debug e4s_cl.cli.commands.profile.detect:77] e4s-cl profile detect args: Namespace(profile_name=None, cmd=['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'])
[Debug e4s_cl.util:211] Running with parent status: ['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/python', '/scratch2/GFDL/e4s/bin/bin/e4s-cl', 'profile', 'detect', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester']
[+] Attach <PtraceProcess #1470539> to debugger
[+] Set <PtraceProcess #1470539> options to 1
[+] Created profile gfdl2024.01
[Debug root:483] 
########################################################################################################################################################
E4S CONTAINER LAUNCHER LOGGING INITIALIZED

Timestamp         : 2024-09-03 13:20:20.583304
Hostname          : h11c53
Platform          : Linux-4.18.0-477.27.1.el8_8.88ciq_lts.0.1.x86_64-x86_64-with-glibc2.28
Version           : 1.0.5.dev1+g35e5e6a
Python Version    : 3.12.4
Working Directory : /scratch2/GFDL/e4s/Thomas.Robinson/containers
Terminal Size     : 152x32
Frozen            : False
Log ID            : 0762531179c2b4d0051837a22b8316642f05373133dfabc70dca9d4f1093cea8
########################################################################################################################################################

[Debug e4s_cl.cli.commands.__main__:77] e4s-cl args: Namespace(command='init', options=['--profile', 'gfdl2024.01', '--launcher', 'srun', '--backend', 'singularity', '--image', '/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif'], dry_run=None)
[Debug e4s_cl.cli.commands.init:77] e4s-cl init args: Namespace(profile_name='gfdl2024.01', launcher='/apps/slurm/default/bin/srun', backend='singularity', image='/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif', cmd=[])
[Debug e4s_cl.cf.storage.local_file:50] '/home/Thomas.Robinson/.local/e4s_cl/user.json' opened read-write
[Debug e4s_cl.cf.storage.local_file:170] Initialized user database '/home/Thomas.Robinson/.local/e4s_cl/user.json'
[+] Tracing MPI execution using:
[+] '/apps/slurm/default/bin/srun /scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'
[Debug e4s_cl.cli.commands.profile.detect:77] e4s-cl profile detect args: Namespace(profile_name=None, cmd=['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'])
[Debug e4s_cl.util:211] Running with parent status: ['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/python', '/scratch2/GFDL/e4s/bin/bin/e4s-cl', 'profile', 'detect', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester']
Failed to determine necessary libraries: program exited with code 156
[+] Attach <PtraceProcess #1470679> to debugger
[+] Set <PtraceProcess #1470679> options to 1
[+] Created profile gfdl2024.01

Here are the modules I have loaded:

$ module list
Currently Loaded Modules:
  1) gnu/9.2.0   2) impi/2020

My container is using gcc 13 and mpich installed with spack.

FrederickDeny commented 2 months ago

Hi Thomas, the fact that the created profile's name isn't specifying a mpi vendor ("[+] Created profile gfdl2024.01") indicates that e4s-cl failed to find either libmpi.so.12, libmpi_cray.so.12 or libmpi.so.40. e4s-cl will try to locate any of these three and will name the newly created profile correspondingly.

Could you check if the correct libmpi.so is in your LD_LIBRARY_PATH?

spoutn1k commented 2 months ago

The idea behind the init command is to understand what the MPI environment is and save the detected configuration to avoid computing it everytime.

This is done using a python script to access an MPI library from the environment, load and use well-known symbols to run basic operations to ensure it is working properly and loads all the library it needs to function (As they can sometimes lazy-load libraries).

You can see this in action here:

[Debug e4s_cl.util:211] Running with parent status: ['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/python', '/scratch2/GFDL/e4s/bin/bin/e4s-cl', 'profile', 'detect', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester']
Failed to determine necessary libraries: program exited with code 156

You can see how this is done here. Intel MPI is treated as MPICH as they share ABI and sonames.

As Frederick suggested, something is preventing the proper analysis of your MPI environment. Please share the contents of the created profile and, if possible, compile a sample MPI program with this environment and share the output of ldd on it. What often happens is either a RPATH or an arbitrary soname is going against the MPI standard practices, and e4s-cl cannot adjust for that.

If you can, try running that tester script in your desired MPI environment and see if it gives you any information about what is failing /scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester.