NOAA-GSL / ExascaleWorkflowSandbox

Other
2 stars 2 forks source link

Add CI test for Docker container #20

Closed christopherwharrop-noaa closed 1 year ago

christopherwharrop-noaa commented 1 year ago

This PR adds a CI test for ExaWorks installation scripts using experimental Docker actions that should allow caching of container layers at GH.

As an aside, this PR also adds python linters, formatters, and type checkers to the Spack install script to facilitate better testing when writing example scripts in the container.

Example Python scripts can, of course, still be linted outside the container, but it's helpful to allow for it to be done inside as well.

One challenging aspect of this PR was the splitting of the ExaWorks installation into smaller pieces. Normally, a Spack installation of ExaWorks is done with a single package install in a Spack environment. Unfortunately, the ExaWorks package takes a very long time to install, and exceeds the time limit for Github Actions jobs. This required splitting the installation of ExaWorks into each of its constituent pieces: flux, radical, stc, and parsl. A prerequisite to doing that is installation of a newer version of gcc, which has to be built against the default compiler. To make matters more complicated, to avoid inducing many duplicate rebuilds of the gcc compiler for each of the constituent pieces, gcc needs to be built twice in a base environment -- once using the default system compiler, and then again using itself. This removes dependencies for Spack packages on the system default compiler from the environment. The end result is a series of installation scripts. A base installation script installs the gcc compiler and python and some python utilities. Then, the flux, radical, stc, and parsl installation scripts use that base environment to install each of their respective parts. A final script is then used to install all of the parts into a final exaworkssdk Spack environment at the end. Docker containers are used to test each install script and to facilitate their testing. A final multi-stage docker container uses Spack mirrors created from each of the constituent Docker container to build up the final environment.

christopherwharrop-noaa commented 1 year ago

Very annoyingly, it looks like the build time of the CentOS7 container is longer than the 6 hour maximum job length Github Actions allows. I can't think of a good way to break this down into smaller CI jobs since the build is really not divisible. The only way to get this working is to spin up a self-hosted runner in the cloud and use a more powerful machine. But I'm not sure how to do that given those resources require funds and a project, etc. @christinaholtNOAA / @venitahagerty have any ideas? The Ubuntu build also just BARELY made it with only 27 minutes to spare.

kirklholub commented 1 year ago

I have posted this question to GH support to see if the time limit can be increased.

--Kirk

On Thu, Mar 9, 2023 at 3:13 PM Christopher Harrop @.***> wrote:

Very annoyingly, it looks like the build time of the CentOS7 container is longer than the 6 hour maximum job length Github Actions allows. I can't think of a good way to break this down into smaller CI jobs since the build is really not divisible. The only way to get this working is to spin up a self-hosted runner in the cloud and use a more powerful machine. But I'm not sure how to do that given those resources require funds and a project, etc. @christinaholtNOAA https://github.com/christinaholtNOAA / @venitahagerty https://github.com/venitahagerty have any ideas? The Ubuntu build also just BARELY made it with only 27 minutes to spare.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/ExascaleWorkflowSandbox/pull/20#issuecomment-1462900224, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGR46I2WOQJSI7Y2LXSFTCTW3JIZLANCNFSM6AAAAAAVUMQKUE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

christopherwharrop-noaa commented 1 year ago

Thanks @kirklholub. It's my understanding this is a hard limit, but if we can get an increase, that would be great. The action I used supposedly do caching of the container layers, so we would, in theory, only really need to run this again if the Docker image layers change. I haven't tried it yet in practice though. The basic problem is that I'm doing a Spack install of exaworks inside the container, and two of the dependencies -- rust and gcc -- take an eternity to build. I think the rust build alone is ~3 hours and gcc might be 1 hour. It's pretty ridiculous how long it takes to build this thing.

christopherwharrop-noaa commented 1 year ago

@kirklholub - I have an idea for splitting the containers into pieces that might work. I'm going to try it out locally first. If it works, I'll push an update to this PR and see how much the time is reduced.

christopherwharrop-noaa commented 1 year ago

@kirklholub - My attempt to split up the installation into smaller pieces and use a multi-stage build that creates and passes Spack build caches between stages seems to have worked for Ubuntu. See checks above. The longest piece takes a little over 3 hours.

The Ubuntu piece did (barely) succeed last time as well. Now I need to repeat this for CentOS7. Given that the longest piece has almost 3 hours of breathing room, I'm reasonably confident it will work. Another push to update the workflow with CentOS7 containers in addition to the Ubuntu20 ones will be coming soon.

christopherwharrop-noaa commented 1 year ago

@kirklholub - It took a considerable amount of work and time, but I was able to split the installation of ExaWorks into multiple pieces and use a multi-stage Docker container, along with Spack build caches and mirrors, to get the final container built correctly. All told, it took ~387 minutes to run the CI for CentOS. The maximum GHA job time was 196 minutes which is well below the GHA job limit of 360. The vast majority of that time was spent building gcc (twice) and rust.

christopherwharrop-noaa commented 1 year ago

@venitahagerty - Thank you for your review. I can see how the usage of Ubuntu for the CentOS CI jobs would be confusing. When you specify a Github Actions job, you have to tell it what OS you want that GHA job to run on. Since this CI is just building containers, it doesn't really matter what OS is used to build them. To keep things simple, i just went with the vanilla Ubuntu for the OS of all the CI jobs. What that's saying is, create a Github runner with an Ubuntu OS, and then use it to run the docker build commands to build the CentOS containers. Does that help?

venitahagerty commented 1 year ago

Yes, thank you