Open AlexanderRichert-NOAA opened 4 months ago
Let's talk about this at tomorrow's spack-stack meeting. I agree with you 100% that regular testing on tier-1 platforms is needed. We did have similar ideas and efforts in the past, and there are even open (and closed) issues for that.
This is essential.
A recent example was the parallel I/O problems that absorbed a lot of debugging time, and ended up being a problem with HDF5-1.14.0. However, running the HDF5 tests showed the problem right away.
If the tests had been run at install, an expensive debugging effort would have been saved. Countless hours were spent checking various components before this was resolved.
The entire IO stack needs to be tested on every single install.
NetCDF and HDF5 need to have their parallel tests run as well. I can work with some slurm expert to get these tests to run out of the box on future releases of netcdf-c, and potentially HDF5.
Running the tests is the best way to find WCOSS2 portability issues. If the tests pass on my machine, but fail on WCOSS2, that is a problem that needs to be solved, before anything else happens. If the netCDF tests fail, just stop the bus and get off. You're not going anywhere until that's fixed.
So let's put the process in place to discover that problem as soon as we can.
I suggest we add unit testing to our standard installation steps on supported platforms. This will help us catch various issues early, including platform config issues, portability problems, compiler issues, and so on. I'd love to have something in place for the 1.8.0 release, even if it's just for a handful of packages (NCEPLIBS would be a good place to start). We don't have run-time (as opposed to install-time) tests implemented for as far as I know any of our packages, so I think it makes sense to focus on install-time tests, though that's certainly up for discussion.
I'm thinking of changing our installation procedures to something to the effect of:
High-priority packages for EMC:
Here is a list of packages that can easily be tested right off the bat; I'll create a PR with setup.sh and doco updates:
Here are some packages that have testing but need some kind of fixing: