ExaWorks / SDK

ExaWorks SDK
11 stars 12 forks source link

Idea: Build Matrix showing the status of each component on various platforms #2

Closed SteVwonder closed 1 year ago

SteVwonder commented 3 years ago

As a driver for our CI setup, we wanted to start with a build matrix we would like to see for each component.

Before we get to the build matrix, I thought it would be helpful to list out the things that could cause incompatibles or issues with our components on various systems:

For any of the pure python components, the main things are:

For the compilable components: the other things are:

Any thoughts, feedback, omissions? Once we have converged on these lists, we can start working on a build matrix.

andre-merzky commented 3 years ago

Not sure if we want to ensure the same for spack.

I don't not know spack well enough yet to have an informed opinion, but that sounds like a reasonable request.

From our perspective, platform portability is also (beyond what is listed already) defined by the batch system and launch methods available on the target system. I am not sure if we want to capture that here though?

SteVwonder commented 3 years ago

From our perspective, platform portability is also (beyond what is listed already) defined by the batch system and launch methods available on the target system. I am not sure if we want to capture that here though?

That's a good suggestion. Suggested RMs/launchers to test against? Slurm/srun, LSF/jsrun, PBS/mpirun, PMIx/prrte, Cobalt/aprun. Any others? For any of those that are open-source, it might be possible test in GH actions, but for the vendor-specific RMs, we will probably need to wait for ECP CI integration.

dongahn commented 3 years ago

@SteVwonder and I chatted about this today. We will break this down to:

1) Coverage table for Cloud CI 2) Coverage table for HPC CI

Also we will have a notion of "base testing coverage" that we should do first with highest priority (e.g., one linux distro of N possibility is the first platform, one python version of 3.6, 3.7, 3.8, 3.9 will be the first python version, one version from each of the 4 technologies etc)

Before our all-hands meeting, it would be good to incorporate your feedback from each of the 4 technologies into that.

SteVwonder commented 3 years ago

Another distinction that we made is that the Cloud CI should be targeted at weeding out the generic things like incompatible dependencies between components or incompatible interfaces between components/dependencies. This will free up the HPC CI for things you can't (easily) test in the cloud, like performance, IB/RDMA support, and vendor-specific software (e.g., launchers, MPIs, and compilers).

We want to start with the minimal working example of the CI and then gradually expand the coverage. I tried to capture this in the table.

Test Dimension Tier 1 Tier 2 Tier 3
OS Linux Mac Windows
Python 3.6 3.9 3.7, 3.8
Architecture x86 PPC Arm
Linux Distro Centos 7 Centos 8
(or equivalent)
Ubuntu 20.04
Compiler gcc@4.8.5 clang@3.4.2 gcc@10, clang@11
MPI OpenMPI MPICH MVAPICH2

So a first minimal example could be an x86 docker image based on Centos 7 with gcc@4.8.5 (should be the default), python 3.6, and OpenMPI (what version comes via Yum).

That gives us a base image to start with, from there we should install the various components in the SDK: Flux, Radical, Parsl, Swift/T. Of course all of these have versions of their own. To start, my recommendation would be that we test all of the latest release tags of components, then expand to include all of the latest master/main/devel branches for each (2 different combinations, not 28).

Once all of the components are installed, we should execute a simple "hello world" test for each individual component and for each of the working pair-wise integrations. I believe the only pairwise integration ready for testing is Radical + Flux, but there may be additional integrations working that I'm unaware of. The Parsl + Flux integration is implemented but hasn't been upstreamed yet.

So from a Github CI build matrix perspective, I think we can start with just a single runner based on a single docker image (described above). Within that we can execute a single script that tests the various pair-wise integrations and reports back the results of each.

andre-merzky commented 3 years ago

That matrix approach works from our end and makes sense IMHO, as does the CentoOS starting point. Some minor comments for Radical:

SteVwonder commented 3 years ago

Per our discussion on the weekly call, once Centos 8 dies, we plan to switch to both Rocky Linux (or equivalent) and Centos Streams, but we need to determine a plan for what we do if Centos Streams breaks our CI builds (and it is a distro issue, not an issue on our end).