Closed SteVwonder closed 1 year ago
Not sure if we want to ensure the same for spack.
I don't not know spack well enough yet to have an informed opinion, but that sounds like a reasonable request.
From our perspective, platform portability is also (beyond what is listed already) defined by the batch system and launch methods available on the target system. I am not sure if we want to capture that here though?
From our perspective, platform portability is also (beyond what is listed already) defined by the batch system and launch methods available on the target system. I am not sure if we want to capture that here though?
That's a good suggestion. Suggested RMs/launchers to test against? Slurm/srun, LSF/jsrun, PBS/mpirun, PMIx/prrte, Cobalt/aprun. Any others? For any of those that are open-source, it might be possible test in GH actions, but for the vendor-specific RMs, we will probably need to wait for ECP CI integration.
@SteVwonder and I chatted about this today. We will break this down to:
1) Coverage table for Cloud CI 2) Coverage table for HPC CI
Also we will have a notion of "base testing coverage" that we should do first with highest priority (e.g., one linux distro of N possibility is the first platform, one python version of 3.6, 3.7, 3.8, 3.9 will be the first python version, one version from each of the 4 technologies etc)
Before our all-hands meeting, it would be good to incorporate your feedback from each of the 4 technologies into that.
Another distinction that we made is that the Cloud CI should be targeted at weeding out the generic things like incompatible dependencies between components or incompatible interfaces between components/dependencies. This will free up the HPC CI for things you can't (easily) test in the cloud, like performance, IB/RDMA support, and vendor-specific software (e.g., launchers, MPIs, and compilers).
We want to start with the minimal working example of the CI and then gradually expand the coverage. I tried to capture this in the table.
Test Dimension | Tier 1 | Tier 2 | Tier 3 |
---|---|---|---|
OS | Linux | Mac | Windows |
Python | 3.6 | 3.9 | 3.7, 3.8 |
Architecture | x86 | PPC | Arm |
Linux Distro | Centos 7 | Centos 8 (or equivalent) |
Ubuntu 20.04 |
Compiler | gcc@4.8.5 | clang@3.4.2 | gcc@10, clang@11 |
MPI | OpenMPI | MPICH | MVAPICH2 |
So a first minimal example could be an x86 docker image based on Centos 7 with gcc@4.8.5 (should be the default), python 3.6, and OpenMPI (what version comes via Yum).
That gives us a base image to start with, from there we should install the various components in the SDK: Flux, Radical, Parsl, Swift/T. Of course all of these have versions of their own. To start, my recommendation would be that we test all of the latest release tags of components, then expand to include all of the latest master
/main
/devel
branches for each (2 different combinations, not 28).
Once all of the components are installed, we should execute a simple "hello world" test for each individual component and for each of the working pair-wise integrations. I believe the only pairwise integration ready for testing is Radical + Flux
, but there may be additional integrations working that I'm unaware of. The Parsl + Flux
integration is implemented but hasn't been upstreamed yet.
So from a Github CI build matrix perspective, I think we can start with just a single runner based on a single docker image (described above). Within that we can execute a single script that tests the various pair-wise integrations and reports back the results of each.
That matrix approach works from our end and makes sense IMHO, as does the CentoOS starting point. Some minor comments for Radical:
master
are identical for usPer our discussion on the weekly call, once Centos 8 dies, we plan to switch to both Rocky Linux (or equivalent) and Centos Streams, but we need to determine a plan for what we do if Centos Streams breaks our CI builds (and it is a distro issue, not an issue on our end).
As a driver for our CI setup, we wanted to start with a build matrix we would like to see for each component.
Before we get to the build matrix, I thought it would be helpful to list out the things that could cause incompatibles or issues with our components on various systems:
For any of the pure python components, the main things are:
For the compilable components: the other things are:
Compiler and version: gcc (4.8.5+) and clang (3.4.2+)
In terms of packaging, we will want to ensure that every component has its latest versions in spack and conda-forge. For spack, we will want to make sure that doing
spack install <package-name>
works out of the box - i.e., the package is specified sufficiently that the concretizer doesn't select a broken combination of dependencies. Since spack packages are constantly changing and Spack prefers the latest version of dependencies, an update to a dependency package can break the building of a component with the default specification. For conda, we will want to ensure that all of the components can be installed within the same conda environment without dependent package conflicts. Not sure if we want to ensure the same for spack.Any thoughts, feedback, omissions? Once we have converged on these lists, we can start working on a build matrix.