Closed zekemorton closed 2 years ago
Thanks Zeke!
I just built this and the swift-t worked for me.
$ dock run -it centos7-ompi-swift-2021-09-03_11-33 bash
[root@f449f0e99ea4 /]# which pip
/ve_exaworks/bin/pip
[root@f449f0e99ea4 /]# export TURBINE_LAUNCH_OPTIONS=--allow-run-as-root
[root@f449f0e99ea4 /]# swift-t -v
STC: Swift-Turbine Compiler 0.9.0
for Turbine: 1.3.0
Using Java VM: /usr/bin/java
Using Turbine in: /opt/swift-t/turbine
Turbine 1.3.0
installed: /opt/swift-t/turbine
source: /tmp/build-swift-t/swift-t/turbine/code
using CC: /usr/local/bin/mpicc
using MPI: /usr/local/lib mpi "OpenMPI"
using Tcl: /opt/tcl-8.6.11/bin/tclsh8.6
[root@f449f0e99ea4 /]# swift-t -E 'trace(42);'
[f449f0e99ea4:00138] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_ess_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/usb3/docker/overlay2/l/3WU5MFEP4FNEMNZRUAMOKDFP7S:/usb3/docker/overlay2/l/I2OZJVC7RKPLJTPIMNJ6VM63ZC:/usb3/docker/overlay2/l/VTCSSM3CAUZDT344WAZ2K4JY32:/usb3/docker/overlay2/l/MWJZV5MX3FF5MHF7H62WSX53B5:/usb3/docker/overlay2/l/NLSW4Y3TIEIM23MOVPAB2CHJRG:/usb3/docker/overlay2/l/7VHBJSN4CTBRZTZOOXL3UPEKTW:/usb3/docker/overlay2/l/2HBTE2G4OG5XLZD5N5RKI6M5QU:/usb3/docker/overlay2/l/ZC4KSZELO55NQ2KODHGEJZEWXN:/usb3/docker/overlay2/l/QQJWIIJ4K2XJE4V7ONESNDYYRG:/usb3/docker/'
[f449f0e99ea4:00138] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_db_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
[f449f0e99ea4:00138] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_grpcomm_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
trace: 42
$ mpiexec --allow-run-as-root echo
[f449f0e99ea4:00161] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_ess_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/usb3/docker/overlay2/l/3WU5MFEP4FNEMNZRUAMOKDFP7S:/usb3/docker/overlay2/l/I2OZJVC7RKPLJTPIMNJ6VM63ZC:/usb3/docker/overlay2/l/VTCSSM3CAUZDT344WAZ2K4JY32:/usb3/docker/overlay2/l/MWJZV5MX3FF5MHF7H62WSX53B5:/usb3/docker/overlay2/l/NLSW4Y3TIEIM23MOVPAB2CHJRG:/usb3/docker/overlay2/l/7VHBJSN4CTBRZTZOOXL3UPEKTW:/usb3/docker/overlay2/l/2HBTE2G4OG5XLZD5N5RKI6M5QU:/usb3/docker/overlay2/l/ZC4KSZELO55NQ2KODHGEJZEWXN:/usb3/docker/overlay2/l/QQJWIIJ4K2XJE4V7ONESNDYYRG:/usb3/docker/'
[f449f0e99ea4:00161] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_db_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
[f449f0e99ea4:00161] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_grpcomm_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
Ok we discussed this after the call- part of this is just a warning from OpenMPI PACKAGE_MANAGER=pip. It looks like there is a more serious issue when PACKAGE_MANAGER=conda , I will try that next...
@mtitov I made the changes to not support centos8 + conda, and moved the code from the Dockerfile to a new script. I believe this is ready to merge pending your approval
@zekemorton Zeke, thank you! will go through it soon
Let's discuss this today to see if we can quickly merge this in.It would be wise to close the door for past items and start to focus on the current thrust: Spack and EPC HPC CI work. :-)
Follow-up comments on latest updates:
gcc
and g++
from conda wouldn't work out, thus ones from system repo were used
flux-sched
building process kept failing; (2) manually built gcc
- all packages are built, but tests for flux
and swift-t
failedCentOS7 - building flux-sched
- gcc 4.8.5
#6 11.35 flux-sched version 0.17.0
#6 11.35 Prefix...........: /usr
#6 11.35 Debug Build......:
#6 11.35 C Compiler.......: gcc -std=gnu99
#6 11.35 C++ Compiler.....: g++ -std=c++11
#6 11.35 CFLAGS...........: -g -O2
#6 11.35 CPPFLAGS..........
#6 11.35 CXXFLAGS.......... -g -O2
#6 11.35 FLUX.............: /usr/bin/flux
#6 11.35 FLUX_VERSION.....: 0.28.0
#6 11.35 FLUX_CORE_CFLAGS.:
#6 11.35 FLUX_CORE_LIBS...: -lflux-core
#6 11.35 LIBFLUX_VERSION..: 0.28.0
#6 11.35 FLUX_PREFIX......: /usr
#6 11.35 LDFLAGS..........: -Wl,-rpath,/ve_exaworks/lib -L/ve_exaworks/lib
#6 11.35 LIBS.............:
#6 11.35 Linker...........: /usr/bin/ld -m elf_x86_64
.....
#6 19.81 make[3]: Entering directory '/flux-sched-0.17.0/resource/libjobspec'
#6 19.81 CXX libjobspec_conv_la-jobspec.lo
#6 19.82 CXX flux-jobspec-validate.o
#6 24.12 CXXLD libjobspec_conv.la
#6 24.23 CXXLD flux-jobspec-validate
#6 24.39 ./.libs/libjobspec_conv.a(libjobspec_conv_la-jobspec.o): In function `Flux::Jobspec::Jobspec::Jobspec(std::string const&)':
#6 24.39 /flux-sched-0.17.0/resource/libjobspec/jobspec.cpp:411: undefined reference to `YAML::Load(std::string const&)'
#6 24.39 ./.libs/libjobspec_conv.a(libjobspec_conv_la-jobspec.o): In function `YAML::Node::Scalar() const':
#6 24.39 /ve_exaworks/include/yaml-cpp/node/impl.h:169: undefined reference to `YAML::detail::node_data::empty_scalar()'
#6 24.39 collect2: error: ld returned 1 exit status
#6 24.39 make[3]: *** [Makefile:492: flux-jobspec-validate] Error 1
#6 24.39 make[3]: Leaving directory '/flux-sched-0.17.0/resource/libjobspec'
#6 24.39 make[2]: *** [Makefile:1054: all-recursive] Error 1
#6 24.39 make[2]: Leaving directory '/flux-sched-0.17.0/resource'
#6 24.39 make[1]: Leaving directory '/flux-sched-0.17.0'
#6 24.39 make[1]: *** [Makefile:512: all-recursive] Error 1
#6 24.39 make: *** [Makefile:444: all] Error 2
CentOS7 - gcc 8.5.0
- most likely I didn't set some paths correctly, but couldn't figure out what and where
flux testing:
========================================
flux-core 0.28.0: t/test-suite.log
========================================
# TOTAL: 8
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 8
.. contents:: :depth: 2
ERROR: t2610-job-shell-mpir
===========================
lua: (command line):1: module 'posix' not found:
no field package.preload['posix']
no file './posix.lua'
no file './posix.lua'
no file '/usr/share/lua/5.1/posix.lua'
no file '/usr/share/lua/5.1/posix/init.lua'
no file '/usr/lib64/lua/5.1/posix.lua'
no file '/usr/lib64/lua/5.1/posix/init.lua'
no file './posix.lua'
no file '/usr/share/lua/5.1/posix.lua'
no file '/usr/share/lua/5.1/posix/init.lua'
no file '/usr/lib64/lua/5.1/posix.lua'
no file '/usr/lib64/lua/5.1/posix/init.lua'
no file '/tmp/flux-core/src/bindings/lua/.libs/posix.so'
no file './posix.so'
no file '/usr/lib64/lua/5.1/posix.so'
no file '/usr/lib64/lua/5.1/loadall.so'
stack traceback:
[C]: in function 'require'
(command line):1: in main chunk
[C]: ?
error: failed to find lua posix module in path
ERROR: t2610-job-shell-mpir.t - missing test plan
ERROR: t2610-job-shell-mpir.t - exited with status 1
ERROR: t3000-mpi-basic
======================
lua: (command line):1: module 'posix' not found:
no field package.preload['posix']
no file './posix.lua'
no file './posix.lua'
no file '/usr/share/lua/5.1/posix.lua'
no file '/usr/share/lua/5.1/posix/init.lua'
no file '/usr/lib64/lua/5.1/posix.lua'
no file '/usr/lib64/lua/5.1/posix/init.lua'
no file './posix.lua'
no file '/usr/share/lua/5.1/posix.lua'
no file '/usr/share/lua/5.1/posix/init.lua'
no file '/usr/lib64/lua/5.1/posix.lua'
no file '/usr/lib64/lua/5.1/posix/init.lua'
no file '/tmp/flux-core/src/bindings/lua/.libs/posix.so'
no file './posix.so'
no file '/usr/lib64/lua/5.1/posix.so'
no file '/usr/lib64/lua/5.1/loadall.so'
stack traceback:
[C]: in function 'require'
(command line):1: in main chunk
[C]: ?
error: failed to find lua posix module in path
ERROR: t3000-mpi-basic.t - missing test plan
ERROR: t3000-mpi-basic.t - exited with status 1
ERROR: t3001-mpi-personalities
==============================
lua: (command line):1: module 'posix' not found:
no field package.preload['posix']
no file './posix.lua'
no file './posix.lua'
no file '/usr/share/lua/5.1/posix.lua'
no file '/usr/share/lua/5.1/posix/init.lua'
no file '/usr/lib64/lua/5.1/posix.lua'
no file '/usr/lib64/lua/5.1/posix/init.lua'
no file './posix.lua'
no file '/usr/share/lua/5.1/posix.lua'
no file '/usr/share/lua/5.1/posix/init.lua'
no file '/usr/lib64/lua/5.1/posix.lua'
no file '/usr/lib64/lua/5.1/posix/init.lua'
no file '/tmp/flux-core/src/bindings/lua/.libs/posix.so'
no file './posix.so'
no file '/usr/lib64/lua/5.1/posix.so'
no file '/usr/lib64/lua/5.1/loadall.so'
stack traceback:
[C]: in function 'require'
(command line):1: in main chunk
[C]: ?
error: failed to find lua posix module in path
ERROR: t3001-mpi-personalities.t - missing test plan
ERROR: t3001-mpi-personalities.t - exited with status 1
ERROR: t3003-mpi-abort
======================
lua: (command line):1: module 'posix' not found:
no field package.preload['posix']
no file './posix.lua'
no file './posix.lua'
no file '/usr/share/lua/5.1/posix.lua'
no file '/usr/share/lua/5.1/posix/init.lua'
no file '/usr/lib64/lua/5.1/posix.lua'
no file '/usr/lib64/lua/5.1/posix/init.lua'
no file './posix.lua'
no file '/usr/share/lua/5.1/posix.lua'
no file '/usr/share/lua/5.1/posix/init.lua'
no file '/usr/lib64/lua/5.1/posix.lua'
no file '/usr/lib64/lua/5.1/posix/init.lua'
no file '/tmp/flux-core/src/bindings/lua/.libs/posix.so'
no file './posix.so'
no file '/usr/lib64/lua/5.1/posix.so'
no file '/usr/lib64/lua/5.1/loadall.so'
stack traceback:
[C]: in function 'require'
(command line):1: in main chunk
[C]: ?
error: failed to find lua posix module in path
ERROR: t3003-mpi-abort.t - missing test plan
ERROR: t3003-mpi-abort.t - exited with status 1
swift-t testing:
+ export TURBINE_LAUNCH_OPTIONS=--allow-run-as-root
+ TURBINE_LAUNCH_OPTIONS=--allow-run-as-root
+ swift-t -v
STC: Swift-Turbine Compiler 0.9.0
for Turbine: 1.3.0
Using Java VM: /ve_exaworks/bin/java
Using Turbine in: /opt/swift-t/turbine
Turbine 1.3.0
installed: /opt/swift-t/turbine
source: /tmp/build-swift-t/swift-t/turbine/code
using CC: /usr/local/bin/mpicc
using MPI: /usr/local/lib mpi "OpenMPI"
using Tcl: /ve_exaworks/bin/tclsh8.6
+ swift-t -E 'trace(42);'
[eabc83c3bd63:00172] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_ess_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/46R5UDPO4OP366UUGPROPGMZMY:/var/lib/docker/overlay2/l/5VEQ4HPIC7BKM365SDNWV5MDJS:/var/lib/docker/overlay2/l/UWQLT5PCSGIGF3RIRQZMSQ72X5:/var/lib/docker/overlay2/l/RLIN7OCX6BC6UXW2QMWCOTPTGP:/var/lib/docker/overlay2/l/EFXDWJ33VYUFUJ7IA7LCTPJZOH:/var/lib/docker/overlay2/l/VK3IHHBXSO6AEO72FMTCR2HM5F:/var/lib/docker/overlay2/l/P7HXKD4CIH3C7G32Q76JU55E6B:/var/lib/docker/overlay2/l/CDQQ2SO7CDESCUTHFN43JG5TM7:/var/lib/docker/overlay2/l/O4E2WHKZ23PRP'
Unexpected end of /proc/mounts line `HPM6UUIBGNROE:/var/lib/docker/overlay2/l/CQKPRDDS7FF2DIMM7CAFSE4HL2:/var/lib/docker/overlay2/l/O5AEUSZMP4K4H26VB7EPUW3FO4:/var/lib/docker/overlay2/l/PGIYWXTGVSPCPHIAYU5CDPTNQU:/var/lib/docker/overlay2/l/24L7H5XJ6G5ESILEKZAL72OVA3:/var/lib/docker/overlay2/l/67IJ7EV2FO622ED6IDE6KIYIQY:/var/lib/docker/overlay2/l/JWCFV2TVNV7HCXGUM2JYUFTVZJ:/var/lib/docker/overlay2/l/2WC4PTEMMYYVY4IGWOHRJXZ4AL:/var/lib/docker/overlay2/l/6TGSCLHEOQ67MBSRUJL77TZDIY:/var/lib/docker/overlay2/l/RZHZXEDKLM44LEF3MUQGCIZ3DP:/var/lib/do'
[eabc83c3bd63:00172] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_db_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
[eabc83c3bd63:00172] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_grpcomm_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
[eabc83c3bd63:00175] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_posix: /usr/local/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[eabc83c3bd63:00175] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_mmap: /usr/local/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[eabc83c3bd63:00175] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_sysv: /usr/local/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
[eabc83c3bd63:00174] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_posix: /usr/local/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[eabc83c3bd63:00174] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_mmap: /usr/local/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[eabc83c3bd63:00174] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_sysv: /usr/local/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[eabc83c3bd63:174] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[eabc83c3bd63:175] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[6590,1],0]
Exit code: 1
--------------------------------------------------------------------------
@zekemorton and @mtitov: thank you again for adding this into our CI mix. Please see my comments -- mainly software engineering practice feedback. We want to keep our commits history in main
in such a way that someone other than us can understand where they are from and what problem each solves. Such practice will come in really handy for debugging when our software will be used in production.
As an example, I recently added a PR to our Fluxion scheduler with that in mind for your reference. (not that it is the best PR but just a reference. https://github.com/flux-framework/flux-sched/pull/895.
Let me know I you have any questions.
FYI -- Patch requirements section of Patch Requirements should be useful, which is referred in our contribution guide documentation
@dongahn I went ahead and adjusted commit history, squashing a few commits and rewriting commit messages. I can't seem to find your feedback after pushing the changes, so I am not sure if I addressed all of your comments. Would you mind taking another look?
The tests seem to be failing because of an issue with RP and the new version of pymongo
so we would likely need to wait for a fix before tests will pass again
RP got released (v.1.10.1) with pymongo
-related hotfix
@dongahn Dong, can we still ask for your review/approval? or we can proceed with merge-when-passing
label on our own?
Let's discuss this one last time this Friday before merging this in.
This could be a good lesson to learn on our best practice as to how SDK can make progress through the churns of its dependencies (e.g., RP version and its dependency like pymongo
).
I have edited both the centos7 and centos8 base images to use a build variable for PACKAGE_MANAGER that will trigger the build to use either pip or conda to create the virtual env and install the python dependencies. When building the docker images from the other docker files, they will still use pip to install packages into the virtual environment, but in this case it could be the virtual environment created by conda.
As of right now, it looks like it is passing tests for Flux, RP, and Parsl. I am running into some errors with swift-t around mpi libraries that I am not sure how to resolve.
I also updated the git hub actions to include tests for both conda and pip as a part of the matrix.
Work In Progress [WIP]