Open thomas-robinson opened 3 years ago
Hi Tom, Did you use e4s-cl init before this? What command did you use?
Thanks,
On Aug 10, 2021, at 6:05 AM, Tom Robinson @.***> wrote:
I am trying to set profile detect to set up my profile. Here is my sample Fortran program that sums the ranks (run with 11 ranks isum = 55):
PROGRAM hello_world_mpi include 'mpif.h'
integer process_Rank, size_Of_Cluster, ierror integer root_rank, isum
call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size_Of_Cluster, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, process_Rank, ierror)
root_rank = 0 call MPI_Reduce(process_rank, isum, 1, MPI_INT, MPI_SUM, root_rank, MPI_COMM_WORLD, ierror); call MPI_bcast (isum, 1, MPI_INTEGER, root_rank, MPI_COMM_WORLD, ierror)
print *, 'Hello World from process: ', process_Rank, 'of ', size_Of_Cluster, 'sum = ', isum end program I compiled it with mpiifort
$ mpiifort -v mpiifort for the Intel(R) MPI Library 2019 Update 9 for Linux* Copyright 2003-2020, Intel Corporation. ifort version 19.1.3.304 This program runs with mpirun
$ mpirun --version Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923 (id: abd58e492) Copyright 2003-2020, Intel Corporation. $ mpirun -np 11 -hosts lscamd50-d.gfdl.noaa.gov ./test.x Here is my e4s-cl command
$ e4s-cl profile detect -p am4Run mpirun -np 11 -hosts lscamd50-d.gfdl.noaa.gov ./test.x Failed to determine necessary libraries. The advice in the documentation is to specify multiple hosts (https://e4s-project.github.io/e4s-cl/reference/profiles/detect.html#profile-detect), but this is a single node system with 128 cores. How can I get all of the libraries needed to run on my system?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
I am seeing a similar issue with Intel 21. This is peculiar, and the same library has no issue being detected with C programs.
Can you provide the output of e4s-cl -v profile detect -p am4Run mpirun -np 11 -hosts lscamd50-d.gfdl.noaa.gov ./test.x ? I am seeing MPI errors relating that the program has been killed.
The warning in the documentation is meant to prevent false positives. Some libraries lazy-load dependencies depending on the hosts to run on, and on multi-node systems this can result in incomplete profiles. You do not need to worry about this here.
On further testing, it seems like the issue does not originate in e4s-cl
. The library detection is done by leveraging the ptrace
capabilities and recording all open
and openat
syscalls.
This seems to fail with the binaries created by mpiifort
. The exact same error happens when using strace
:
$ mpirun -np 2 strace -e open,openat ./issue-28
[...]
ofi_mlx_hcoll.dat", O_RDONLY) = 67
) = 67
Hello World from process: 0 of 2 sum = 1
Hello World from process: 1 of 2 sum = 1
+++ exited with 0 +++
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 1345869 RUNNING AT illyad
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
I will look into this. I am positive profile detection worked with Fortran binaries compiled with other MPI flavours, so this must be an Intel quirk.
A profile created with a C program can ususally also be used after adding the fortran (libmpifort.so
) libraries from the install directory.
e4s-cl init --profile am4Run
e4s-cl profile edit --add-libraries </path/to/libmpifort.so> ...
Sorry for the delay and long post. I couldn't post before for some reason.
Yes, I ran e4s-cl init
. I forgot to mention that
$ e4s-cl init
The target launcher /opt/intel/2020_up3/compilers_and_libraries/linux/mpi/intel64/bin/mpirun uses a single host by default, which may tamper with the library discovery. Consider running `e4s-cl profile detect` using mpirun specifying multiple hosts.
$ e4s-cl profile detect -p am4Run mpirun -np 11 -hosts lscamd50-d.gfdl.noaa.gov ./test.x
Failed to determine necessary libraries.
I ran the debug and got this:
$ e4s-cl -v profile detect -p am4Run mpirun -np 11 -hosts lscamd50-d.gfdl.noaa.gov ./test.x
[Debug] Arguments: Namespace(command='profile', options=['detect', '-p', 'am4Run', 'mpirun', '-np', '11', '-hosts', 'lscamd50-d.gfdl.noaa.gov', './test.x'], dry_run=None, slave=None, verbose='DEBUG')
[Debug] Verbosity level: DEBUG
[Debug] e4s-cl profile args: Namespace(subcommand='detect', options=['-p', 'am4Run', 'mpirun', '-np', '11', '-hosts', 'lscamd50-d.gfdl.noaa.gov', './test.x'])
[Debug] e4s-cl profile detect args: Namespace(profile_name='am4Run', cmd=['mpirun', '-np', '11', '-hosts', 'lscamd50-d.gfdl.noaa.gov', './test.x'])
[Debug] Creating subprocess: mpirun -np 11 -hosts lscamd50-d.gfdl.noaa.gov /home/Thomas.Robinson/e4s-cl/bin/e4s-cl --slave profile detect ./test.x
[Debug] Hello World from process: 5 of 11 sum = 55
Hello World from process: 3 of 11 sum = 55
Hello World from process: 0 of 11 sum = 55
Hello World from process: 1 of 11 sum = 55
Hello World from process: 4 of 11 sum = 55
Hello World from process: 6 of 11 sum = 55
Hello World from process: 7 of 11 sum = 55
Hello World from process: 8 of 11 sum = 55
Hello World from process: 2 of 11 sum = 55
Hello World from process: 10 of 11 sum = 55
Hello World from process: 9 of 11 sum = 55
{"files": {"__type": "set", "__list": ["/opt/intel/2020_up3/compilers_and_libraries/linux/mpi/intel64/lib/release/libmpi.so.12", "/etc/libnl/classid", "/opt/intel/2020_up3/compilers_and_libraries/linux/mpi/intel64/etc/tuning_generic_shm-ofi.dat"]}, "libraries": {"__type": "set", "__list": ["/lib64/libpsm2.so.2", "/lib64/libnl-route-3.so.200", "/lib64/libc.so.6", "/lib64/libnl-3.so.200", "/lib64/libfabric.so.1", "/lib64/libgcc_s.so.1", "/lib64/libnuma.so.1", "/lib64/libm.so.6", "/lib64/librt.so.1", "/opt/intel/2020_up3/compilers_and_libraries/linux/mpi/intel64/lib/libmpifort.so.12", "/lib64/libdl.so.2", "/lib64/libpthread.so.0", "/lib64/libibverbs.so.1", "/lib64/librdmacm.so.1", "/lib64/libefa.so.1"]}}
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 2127227 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 2127229 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 2127230 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 2127231 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 2127232 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 2127233 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 2127234 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 8 PID 2127235 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 9 PID 2127236 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 2127237 RUNNING AT lscamd50-d.gfdl.noaa.gov
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
[Debug] ['mpirun', '-np', '11', '-hosts', 'lscamd50-d.gfdl.noaa.gov', '/home/Thomas.Robinson/e4s-cl/bin/e4s-cl', '--slave', 'profile', 'detect', './test.x'] returned 255
Failed to determine necessary libraries.
If I change to a C program, do you think it will work then? What libraries do I need to link in?
I tried to launch with the default profile created, but I get a different error
$ e4s-cl launch --backend singularity --image am4_2021.03_ubuntu_intel.sif mpirun -n 48 ./2021.03_run.sh
Using selected profile default-137215bba819ae9d045d5b51c339b35e38c270bdafcf5d6a9181ae2e3640502d
2137479 on lscamd50-d.gfdl.noaa.gov: ./2021.03_run.sh: error while loading shared libraries: ./2021.03_run.sh: invalid ELF header
Maybe this is a different issue.
Github's servers were down for a little while, I wasn't able to edit either !
I tested the profile detection with C programs and it should work. This is just to detect the libraries, you can run Fortran binaries with the tool and it should work as ptrace
is not invoked during execution.
This is another issue unfortunately. e4s-cl
is trying to override the container's dynamic linker, but the end command being a shell script confuses the linker, as it expects a binary. Depending on the contents of 2021.03_run.sh
, you can create a shell script with all the setup steps, and pass on the CLI the final binary call. e4s-cl
can source scripts before execution in the container.
Here I add all but the last line to a setup script, and pass it to e4s-cl
by editing the profile:
head -n -1 2021.03_run.sh > setup.sh
e4s-cl profile edit --source $PWD/setup.sh --backend singularity --image am4_2021.03_ubuntu_intel.sif
e4s-cl launch mpirun -n 48 `tail -n 1 2021.03_run.sh`
I am trying to set profile detect to set up my profile. Here is my sample Fortran program that sums the ranks (run with 11 ranks
isum = 55
):I compiled it with
mpiifort
This program runs with
mpirun
Here is my
e4s-cl
commandThe advice in the documentation is to specify multiple hosts (https://e4s-project.github.io/e4s-cl/reference/profiles/detect.html#profile-detect), but this is a single node system with 128 cores. How can I get all of the libraries needed to run on my system?