ExaWorks / SDK

ExaWorks SDK
11 stars 12 forks source link

Add conda to base image and include in automated testing #70

Closed zekemorton closed 2 years ago

zekemorton commented 3 years ago

I have edited both the centos7 and centos8 base images to use a build variable for PACKAGE_MANAGER that will trigger the build to use either pip or conda to create the virtual env and install the python dependencies. When building the docker images from the other docker files, they will still use pip to install packages into the virtual environment, but in this case it could be the virtual environment created by conda.

As of right now, it looks like it is passing tests for Flux, RP, and Parsl. I am running into some errors with swift-t around mpi libraries that I am not sure how to resolve.

I also updated the git hub actions to include tests for both conda and pip as a part of the matrix.

Work In Progress [WIP]

dongahn commented 3 years ago

Thanks Zeke!

j-woz commented 3 years ago

I just built this and the swift-t worked for me.

$ dock run -it centos7-ompi-swift-2021-09-03_11-33 bash

[root@f449f0e99ea4 /]# which pip 
/ve_exaworks/bin/pip
[root@f449f0e99ea4 /]# export TURBINE_LAUNCH_OPTIONS=--allow-run-as-root
[root@f449f0e99ea4 /]# swift-t -v
STC: Swift-Turbine Compiler 0.9.0
         for Turbine: 1.3.0
Using Java VM:    /usr/bin/java
Using Turbine in: /opt/swift-t/turbine

Turbine 1.3.0
 installed:    /opt/swift-t/turbine
 source:       /tmp/build-swift-t/swift-t/turbine/code
 using CC:     /usr/local/bin/mpicc
 using MPI:    /usr/local/lib mpi "OpenMPI"
 using Tcl:    /opt/tcl-8.6.11/bin/tclsh8.6
[root@f449f0e99ea4 /]# swift-t -E 'trace(42);' 
[f449f0e99ea4:00138] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_ess_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/usb3/docker/overlay2/l/3WU5MFEP4FNEMNZRUAMOKDFP7S:/usb3/docker/overlay2/l/I2OZJVC7RKPLJTPIMNJ6VM63ZC:/usb3/docker/overlay2/l/VTCSSM3CAUZDT344WAZ2K4JY32:/usb3/docker/overlay2/l/MWJZV5MX3FF5MHF7H62WSX53B5:/usb3/docker/overlay2/l/NLSW4Y3TIEIM23MOVPAB2CHJRG:/usb3/docker/overlay2/l/7VHBJSN4CTBRZTZOOXL3UPEKTW:/usb3/docker/overlay2/l/2HBTE2G4OG5XLZD5N5RKI6M5QU:/usb3/docker/overlay2/l/ZC4KSZELO55NQ2KODHGEJZEWXN:/usb3/docker/overlay2/l/QQJWIIJ4K2XJE4V7ONESNDYYRG:/usb3/docker/'
[f449f0e99ea4:00138] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_db_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
[f449f0e99ea4:00138] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_grpcomm_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
trace: 42
j-woz commented 3 years ago
$ mpiexec --allow-run-as-root  echo 
[f449f0e99ea4:00161] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_ess_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/usb3/docker/overlay2/l/3WU5MFEP4FNEMNZRUAMOKDFP7S:/usb3/docker/overlay2/l/I2OZJVC7RKPLJTPIMNJ6VM63ZC:/usb3/docker/overlay2/l/VTCSSM3CAUZDT344WAZ2K4JY32:/usb3/docker/overlay2/l/MWJZV5MX3FF5MHF7H62WSX53B5:/usb3/docker/overlay2/l/NLSW4Y3TIEIM23MOVPAB2CHJRG:/usb3/docker/overlay2/l/7VHBJSN4CTBRZTZOOXL3UPEKTW:/usb3/docker/overlay2/l/2HBTE2G4OG5XLZD5N5RKI6M5QU:/usb3/docker/overlay2/l/ZC4KSZELO55NQ2KODHGEJZEWXN:/usb3/docker/overlay2/l/QQJWIIJ4K2XJE4V7ONESNDYYRG:/usb3/docker/'
[f449f0e99ea4:00161] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_db_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
[f449f0e99ea4:00161] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_grpcomm_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
j-woz commented 3 years ago

Ok we discussed this after the call- part of this is just a warning from OpenMPI PACKAGE_MANAGER=pip. It looks like there is a more serious issue when PACKAGE_MANAGER=conda , I will try that next...

zekemorton commented 3 years ago

@mtitov I made the changes to not support centos8 + conda, and moved the code from the Dockerfile to a new script. I believe this is ready to merge pending your approval

mtitov commented 3 years ago

@zekemorton Zeke, thank you! will go through it soon

dongahn commented 3 years ago

Let's discuss this today to see if we can quickly merge this in.It would be wise to close the door for past items and start to focus on the current thrust: Spack and EPC HPC CI work. :-)

mtitov commented 3 years ago

Follow-up comments on latest updates:


CentOS7 - building flux-sched - gcc 4.8.5

#6 11.35   flux-sched version 0.17.0
#6 11.35   Prefix...........: /usr
#6 11.35   Debug Build......: 
#6 11.35   C Compiler.......: gcc -std=gnu99
#6 11.35   C++ Compiler.....: g++ -std=c++11
#6 11.35   CFLAGS...........: -g -O2
#6 11.35   CPPFLAGS.......... 
#6 11.35   CXXFLAGS.......... -g -O2
#6 11.35   FLUX.............: /usr/bin/flux
#6 11.35   FLUX_VERSION.....: 0.28.0
#6 11.35   FLUX_CORE_CFLAGS.: 
#6 11.35   FLUX_CORE_LIBS...: -lflux-core
#6 11.35   LIBFLUX_VERSION..: 0.28.0
#6 11.35   FLUX_PREFIX......: /usr
#6 11.35   LDFLAGS..........: -Wl,-rpath,/ve_exaworks/lib -L/ve_exaworks/lib
#6 11.35   LIBS.............: 
#6 11.35   Linker...........: /usr/bin/ld -m elf_x86_64
.....
#6 19.81 make[3]: Entering directory '/flux-sched-0.17.0/resource/libjobspec'
#6 19.81   CXX      libjobspec_conv_la-jobspec.lo
#6 19.82   CXX      flux-jobspec-validate.o
#6 24.12   CXXLD    libjobspec_conv.la
#6 24.23   CXXLD    flux-jobspec-validate
#6 24.39 ./.libs/libjobspec_conv.a(libjobspec_conv_la-jobspec.o): In function `Flux::Jobspec::Jobspec::Jobspec(std::string const&)':
#6 24.39 /flux-sched-0.17.0/resource/libjobspec/jobspec.cpp:411: undefined reference to `YAML::Load(std::string const&)'
#6 24.39 ./.libs/libjobspec_conv.a(libjobspec_conv_la-jobspec.o): In function `YAML::Node::Scalar() const':
#6 24.39 /ve_exaworks/include/yaml-cpp/node/impl.h:169: undefined reference to `YAML::detail::node_data::empty_scalar()'
#6 24.39 collect2: error: ld returned 1 exit status
#6 24.39 make[3]: *** [Makefile:492: flux-jobspec-validate] Error 1
#6 24.39 make[3]: Leaving directory '/flux-sched-0.17.0/resource/libjobspec'
#6 24.39 make[2]: *** [Makefile:1054: all-recursive] Error 1
#6 24.39 make[2]: Leaving directory '/flux-sched-0.17.0/resource'
#6 24.39 make[1]: Leaving directory '/flux-sched-0.17.0'
#6 24.39 make[1]: *** [Makefile:512: all-recursive] Error 1
#6 24.39 make: *** [Makefile:444: all] Error 2

CentOS7 - gcc 8.5.0 - most likely I didn't set some paths correctly, but couldn't figure out what and where flux testing:

========================================
   flux-core 0.28.0: t/test-suite.log
========================================

# TOTAL: 8
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 8

.. contents:: :depth: 2

ERROR: t2610-job-shell-mpir
===========================

lua: (command line):1: module 'posix' not found:
    no field package.preload['posix']
    no file './posix.lua'
    no file './posix.lua'
    no file '/usr/share/lua/5.1/posix.lua'
    no file '/usr/share/lua/5.1/posix/init.lua'
    no file '/usr/lib64/lua/5.1/posix.lua'
    no file '/usr/lib64/lua/5.1/posix/init.lua'
    no file './posix.lua'
    no file '/usr/share/lua/5.1/posix.lua'
    no file '/usr/share/lua/5.1/posix/init.lua'
    no file '/usr/lib64/lua/5.1/posix.lua'
    no file '/usr/lib64/lua/5.1/posix/init.lua'
    no file '/tmp/flux-core/src/bindings/lua/.libs/posix.so'
    no file './posix.so'
    no file '/usr/lib64/lua/5.1/posix.so'
    no file '/usr/lib64/lua/5.1/loadall.so'
stack traceback:
    [C]: in function 'require'
    (command line):1: in main chunk
    [C]: ?
error: failed to find lua posix module in path
ERROR: t2610-job-shell-mpir.t - missing test plan
ERROR: t2610-job-shell-mpir.t - exited with status 1

ERROR: t3000-mpi-basic
======================

lua: (command line):1: module 'posix' not found:
    no field package.preload['posix']
    no file './posix.lua'
    no file './posix.lua'
    no file '/usr/share/lua/5.1/posix.lua'
    no file '/usr/share/lua/5.1/posix/init.lua'
    no file '/usr/lib64/lua/5.1/posix.lua'
    no file '/usr/lib64/lua/5.1/posix/init.lua'
    no file './posix.lua'
    no file '/usr/share/lua/5.1/posix.lua'
    no file '/usr/share/lua/5.1/posix/init.lua'
    no file '/usr/lib64/lua/5.1/posix.lua'
    no file '/usr/lib64/lua/5.1/posix/init.lua'
    no file '/tmp/flux-core/src/bindings/lua/.libs/posix.so'
    no file './posix.so'
    no file '/usr/lib64/lua/5.1/posix.so'
    no file '/usr/lib64/lua/5.1/loadall.so'
stack traceback:
    [C]: in function 'require'
    (command line):1: in main chunk
    [C]: ?
error: failed to find lua posix module in path
ERROR: t3000-mpi-basic.t - missing test plan
ERROR: t3000-mpi-basic.t - exited with status 1

ERROR: t3001-mpi-personalities
==============================

lua: (command line):1: module 'posix' not found:
    no field package.preload['posix']
    no file './posix.lua'
    no file './posix.lua'
    no file '/usr/share/lua/5.1/posix.lua'
    no file '/usr/share/lua/5.1/posix/init.lua'
    no file '/usr/lib64/lua/5.1/posix.lua'
    no file '/usr/lib64/lua/5.1/posix/init.lua'
    no file './posix.lua'
    no file '/usr/share/lua/5.1/posix.lua'
    no file '/usr/share/lua/5.1/posix/init.lua'
    no file '/usr/lib64/lua/5.1/posix.lua'
    no file '/usr/lib64/lua/5.1/posix/init.lua'
    no file '/tmp/flux-core/src/bindings/lua/.libs/posix.so'
    no file './posix.so'
    no file '/usr/lib64/lua/5.1/posix.so'
    no file '/usr/lib64/lua/5.1/loadall.so'
stack traceback:
    [C]: in function 'require'
    (command line):1: in main chunk
    [C]: ?
error: failed to find lua posix module in path
ERROR: t3001-mpi-personalities.t - missing test plan
ERROR: t3001-mpi-personalities.t - exited with status 1

ERROR: t3003-mpi-abort
======================

lua: (command line):1: module 'posix' not found:
    no field package.preload['posix']
    no file './posix.lua'
    no file './posix.lua'
    no file '/usr/share/lua/5.1/posix.lua'
    no file '/usr/share/lua/5.1/posix/init.lua'
    no file '/usr/lib64/lua/5.1/posix.lua'
    no file '/usr/lib64/lua/5.1/posix/init.lua'
    no file './posix.lua'
    no file '/usr/share/lua/5.1/posix.lua'
    no file '/usr/share/lua/5.1/posix/init.lua'
    no file '/usr/lib64/lua/5.1/posix.lua'
    no file '/usr/lib64/lua/5.1/posix/init.lua'
    no file '/tmp/flux-core/src/bindings/lua/.libs/posix.so'
    no file './posix.so'
    no file '/usr/lib64/lua/5.1/posix.so'
    no file '/usr/lib64/lua/5.1/loadall.so'
stack traceback:
    [C]: in function 'require'
    (command line):1: in main chunk
    [C]: ?
error: failed to find lua posix module in path
ERROR: t3003-mpi-abort.t - missing test plan
ERROR: t3003-mpi-abort.t - exited with status 1

swift-t testing:

+ export TURBINE_LAUNCH_OPTIONS=--allow-run-as-root
+ TURBINE_LAUNCH_OPTIONS=--allow-run-as-root
+ swift-t -v
STC: Swift-Turbine Compiler 0.9.0
     for Turbine: 1.3.0
Using Java VM:    /ve_exaworks/bin/java
Using Turbine in: /opt/swift-t/turbine

Turbine 1.3.0
 installed:    /opt/swift-t/turbine
 source:       /tmp/build-swift-t/swift-t/turbine/code
 using CC:     /usr/local/bin/mpicc
 using MPI:    /usr/local/lib mpi "OpenMPI"
 using Tcl:    /ve_exaworks/bin/tclsh8.6
+ swift-t -E 'trace(42);'
[eabc83c3bd63:00172] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_ess_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/46R5UDPO4OP366UUGPROPGMZMY:/var/lib/docker/overlay2/l/5VEQ4HPIC7BKM365SDNWV5MDJS:/var/lib/docker/overlay2/l/UWQLT5PCSGIGF3RIRQZMSQ72X5:/var/lib/docker/overlay2/l/RLIN7OCX6BC6UXW2QMWCOTPTGP:/var/lib/docker/overlay2/l/EFXDWJ33VYUFUJ7IA7LCTPJZOH:/var/lib/docker/overlay2/l/VK3IHHBXSO6AEO72FMTCR2HM5F:/var/lib/docker/overlay2/l/P7HXKD4CIH3C7G32Q76JU55E6B:/var/lib/docker/overlay2/l/CDQQ2SO7CDESCUTHFN43JG5TM7:/var/lib/docker/overlay2/l/O4E2WHKZ23PRP'
Unexpected end of /proc/mounts line `HPM6UUIBGNROE:/var/lib/docker/overlay2/l/CQKPRDDS7FF2DIMM7CAFSE4HL2:/var/lib/docker/overlay2/l/O5AEUSZMP4K4H26VB7EPUW3FO4:/var/lib/docker/overlay2/l/PGIYWXTGVSPCPHIAYU5CDPTNQU:/var/lib/docker/overlay2/l/24L7H5XJ6G5ESILEKZAL72OVA3:/var/lib/docker/overlay2/l/67IJ7EV2FO622ED6IDE6KIYIQY:/var/lib/docker/overlay2/l/JWCFV2TVNV7HCXGUM2JYUFTVZJ:/var/lib/docker/overlay2/l/2WC4PTEMMYYVY4IGWOHRJXZ4AL:/var/lib/docker/overlay2/l/6TGSCLHEOQ67MBSRUJL77TZDIY:/var/lib/docker/overlay2/l/RZHZXEDKLM44LEF3MUQGCIZ3DP:/var/lib/do'
[eabc83c3bd63:00172] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_db_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
[eabc83c3bd63:00172] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_grpcomm_pmi: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
[eabc83c3bd63:00175] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_posix: /usr/local/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[eabc83c3bd63:00175] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_mmap: /usr/local/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[eabc83c3bd63:00175] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_sysv: /usr/local/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
[eabc83c3bd63:00174] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_posix: /usr/local/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[eabc83c3bd63:00174] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_mmap: /usr/local/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[eabc83c3bd63:00174] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_shmem_sysv: /usr/local/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[eabc83c3bd63:174] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[eabc83c3bd63:175] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[6590,1],0]
  Exit code:    1
--------------------------------------------------------------------------
dongahn commented 2 years ago

@zekemorton and @mtitov: thank you again for adding this into our CI mix. Please see my comments -- mainly software engineering practice feedback. We want to keep our commits history in main in such a way that someone other than us can understand where they are from and what problem each solves. Such practice will come in really handy for debugging when our software will be used in production.

As an example, I recently added a PR to our Fluxion scheduler with that in mind for your reference. (not that it is the best PR but just a reference. https://github.com/flux-framework/flux-sched/pull/895.

Let me know I you have any questions.

dongahn commented 2 years ago

FYI -- Patch requirements section of Patch Requirements should be useful, which is referred in our contribution guide documentation

zekemorton commented 2 years ago

@dongahn I went ahead and adjusted commit history, squashing a few commits and rewriting commit messages. I can't seem to find your feedback after pushing the changes, so I am not sure if I addressed all of your comments. Would you mind taking another look?

The tests seem to be failing because of an issue with RP and the new version of pymongo so we would likely need to wait for a fix before tests will pass again

mtitov commented 2 years ago

RP got released (v.1.10.1) with pymongo-related hotfix

@dongahn Dong, can we still ask for your review/approval? or we can proceed with merge-when-passing label on our own?

dongahn commented 2 years ago

Let's discuss this one last time this Friday before merging this in.

This could be a good lesson to learn on our best practice as to how SDK can make progress through the churns of its dependencies (e.g., RP version and its dependency like pymongo).