JCSDA / spack-stack

Creative Commons Zero v1.0 Universal
24 stars 43 forks source link

Run package tests with every deployment #1184

Open AlexanderRichert-NOAA opened 1 month ago

AlexanderRichert-NOAA commented 1 month ago

I suggest we add unit testing to our standard installation steps on supported platforms. This will help us catch various issues early, including platform config issues, portability problems, compiler issues, and so on. I'd love to have something in place for the 1.8.0 release, even if it's just for a handful of packages (NCEPLIBS would be a good place to start). We don't have run-time (as opposed to install-time) tests implemented for as far as I know any of our packages, so I think it makes sense to focus on install-time tests, though that's certainly up for discussion.

I'm thinking of changing our installation procedures to something to the effect of:

# Set up environment as usual
spack concretize 2>&1 | tee log.concretize
# New step that will test packages specified in, say, setup.sh:
spack install --verbose --test root $SPACK_STACK_PACKAGES_TO_TEST 2>&1 | tee log.install_withtesting
# Install the rest of the stack as usual:
spack install --verbose 2>&1 | tee log.install

Here is a list of packages that can easily be tested right off the bat; I'll create a PR with setup.sh and doco updates:

Here are some packages that have testing but need some kind of fixing:

climbfuji commented 1 month ago

Let's talk about this at tomorrow's spack-stack meeting. I agree with you 100% that regular testing on tier-1 platforms is needed. We did have similar ideas and efforts in the past, and there are even open (and closed) issues for that.

edwardhartnett commented 1 month ago

This is essential.

A recent example was the parallel I/O problems that absorbed a lot of debugging time, and ended up being a problem with HDF5-1.14.0. However, running the HDF5 tests showed the problem right away.

If the tests had been run at install, an expensive debugging effort would have been saved. Countless hours were spent checking various components before this was resolved.

The entire IO stack needs to be tested on every single install.

NetCDF and HDF5 need to have their parallel tests run as well. I can work with some slurm expert to get these tests to run out of the box on future releases of netcdf-c, and potentially HDF5.

Running the tests is the best way to find WCOSS2 portability issues. If the tests pass on my machine, but fail on WCOSS2, that is a problem that needs to be solved, before anything else happens. If the netCDF tests fail, just stop the bus and get off. You're not going anywhere until that's fixed.

So let's put the process in place to discover that problem as soon as we can.