Idea: Build Matrix showing the status of each component on various platforms

SteVwonder commented 3 years ago

As a driver for our CI setup, we wanted to start with a build matrix we would like to see for each component.

Before we get to the build matrix, I thought it would be helpful to list out the things that could cause incompatibles or issues with our components on various systems:

For any of the pure python components, the main things are:

Operating System: Linux, Mac, Windows
Python Version: 3.6, 3.7, 3.8, 3.9, ...
- Python dependency versions: component-dependent but using RHEL/Centos 7 as the target for the minimal version would be good
Architecture: x86, PPC, Arm
MPI: MPICH, OpenMPI, and their derivatives

For the compilable components: the other things are:

Compiler and version: gcc (4.8.5+) and clang (3.4.2+)
- I suspect we won't ever need to compile our components with vendor compilers. Please correct me if I'm wrong about that.
- The minimal versions are the ones provides by RHEL/Centos 7
- Linux Distro: Ubuntu/Debian, Fedora, Centos/Rocky/RHEL, ...
- in so far as these provide variance in dependency versions (Centos/Rocky/RHEL using very old versions and Fedora/Ubuntu newer)
In terms of packaging, we will want to ensure that every component has its latest versions in spack and conda-forge. For spack, we will want to make sure that doing spack install <package-name> works out of the box - i.e., the package is specified sufficiently that the concretizer doesn't select a broken combination of dependencies. Since spack packages are constantly changing and Spack prefers the latest version of dependencies, an update to a dependency package can break the building of a component with the default specification. For conda, we will want to ensure that all of the components can be installed within the same conda environment without dependent package conflicts. Not sure if we want to ensure the same for spack.

Any thoughts, feedback, omissions? Once we have converged on these lists, we can start working on a build matrix.

andre-merzky commented 3 years ago

Not sure if we want to ensure the same for spack.

I don't not know spack well enough yet to have an informed opinion, but that sounds like a reasonable request.

From our perspective, platform portability is also (beyond what is listed already) defined by the batch system and launch methods available on the target system. I am not sure if we want to capture that here though?

SteVwonder commented 3 years ago

From our perspective, platform portability is also (beyond what is listed already) defined by the batch system and launch methods available on the target system. I am not sure if we want to capture that here though?

That's a good suggestion. Suggested RMs/launchers to test against? Slurm/srun, LSF/jsrun, PBS/mpirun, PMIx/prrte, Cobalt/aprun. Any others? For any of those that are open-source, it might be possible test in GH actions, but for the vendor-specific RMs, we will probably need to wait for ECP CI integration.

dongahn commented 3 years ago

@SteVwonder and I chatted about this today. We will break this down to:

1) Coverage table for Cloud CI 2) Coverage table for HPC CI

Also we will have a notion of "base testing coverage" that we should do first with highest priority (e.g., one linux distro of N possibility is the first platform, one python version of 3.6, 3.7, 3.8, 3.9 will be the first python version, one version from each of the 4 technologies etc)

Before our all-hands meeting, it would be good to incorporate your feedback from each of the 4 technologies into that.

SteVwonder commented 3 years ago

Another distinction that we made is that the Cloud CI should be targeted at weeding out the generic things like incompatible dependencies between components or incompatible interfaces between components/dependencies. This will free up the HPC CI for things you can't (easily) test in the cloud, like performance, IB/RDMA support, and vendor-specific software (e.g., launchers, MPIs, and compilers).

We want to start with the minimal working example of the CI and then gradually expand the coverage. I tried to capture this in the table.

Test Dimension	Tier 1	Tier 2	Tier 3
OS	Linux	Mac	Windows
Python	3.6	3.9	3.7, 3.8
Architecture	x86	PPC	Arm
Linux Distro	Centos 7	Centos 8 (or equivalent)	Ubuntu 20.04
Compiler	gcc@4.8.5	clang@3.4.2	gcc@10, clang@11
MPI	OpenMPI	MPICH	MVAPICH2

So a first minimal example could be an x86 docker image based on Centos 7 with gcc@4.8.5 (should be the default), python 3.6, and OpenMPI (what version comes via Yum).

That gives us a base image to start with, from there we should install the various components in the SDK: Flux, Radical, Parsl, Swift/T. Of course all of these have versions of their own. To start, my recommendation would be that we test all of the latest release tags of components, then expand to include all of the latest master/main/devel branches for each (2 different combinations, not 28).

Once all of the components are installed, we should execute a simple "hello world" test for each individual component and for each of the working pair-wise integrations. I believe the only pairwise integration ready for testing is Radical + Flux, but there may be additional integrations working that I'm unaware of. The Parsl + Flux integration is implemented but hasn't been upstreamed yet.

So from a Github CI build matrix perspective, I think we can start with just a single runner based on a single docker image (described above). Within that we can execute a single script that tests the various pair-wise integrations and reports back the results of each.

andre-merzky commented 3 years ago

That matrix approach works from our end and makes sense IMHO, as does the CentoOS starting point. Some minor comments for Radical:

the latest (version) tag and master are identical for us
we do not support Windows
we do not officially support MacOS (although it might work)

SteVwonder commented 3 years ago

Per our discussion on the weekly call, once Centos 8 dies, we plan to switch to both Rocky Linux (or equivalent) and Centos Streams, but we need to determine a plan for what we do if Centos Streams breaks our CI builds (and it is a distro issue, not an issue on our end).

ExaWorks / SDK

Idea: Build Matrix showing the status of each component on various platforms #2