NWChemEx / ParallelZone

You're travelling through another dimension, a dimension not only of CPUs and threads but of GPUs; a journey into a wondrous land whose boundaries are bandwidth limited. That's the signpost up ahead - your next stop, the ParallelZone!
https://nwchemex.github.io/ParallelZone/
Apache License 2.0
0 stars 1 forks source link

Release Image #109

Open ryanmrichard opened 1 year ago

ryanmrichard commented 1 year ago

This is the next issue after #107 (or #108 if we want to go that route). The idea is that when a PR gets merged to master it should trigger building a release quality image. The image needs to:

The image which results from this build will become the "base image" for downstream repositories, such as PluginPlay. From PluginPlay's perspective this new base image will serve the same role as the .github base image did for ParallelZone.

yzhang-23 commented 1 year ago

I am thinking about how to keep the balance of image consistency (one image update triggers all image updates) and CI efficiency (keep images fixed). Now I think maybe a good way is to build every repo only with the STABLE release images of other repos, and only update all stable images when necessary (e. g., in the case of a major release). With this design, the daily CI is very simple and fast (load the building image containing everything needed including the compiled libraries of other repos, and then compile and test the repo). We need only one workflow to update all images, and this workflow can even be manually triggered since it will not be run very frequently. A failed workflow run to update all images will release some code integration issue (some repo breaks other repos), but will NOT affect the daily CI. Any comments?

ryanmrichard commented 1 year ago

I feel like we need to break this problem down and try to build up to a solution. So at this point I am going to suggest:

  1. Add two workflows to .github which build the base images for ParallelZone.
    • One workflow should be GCC based, the other clang based.
    • The images should only contain the dependencies for ParallelZone, don't worry about other repos.
  2. Add two workflows (one for GCC the other for clang) to ParallelZone which use the base images from the previous step to build ParallelZone and run the tests.
  3. Add two different workflows to ParallelZone which build release images (one GCC-based, one clang-based) which trigger when master is updated.
  4. Repeat step 2, but for PluginPlay and using ParallelZone's release images as the base images
  5. Repeat step 3, but for PluginPlay

Each of the above steps can be a separate PR and it should be possible to accomplish those PRs without breaking the current CI pipeline. That's because repos which depend on PZ or PluginPlay should still be able to build since their CI includes steps for building PZ/PluginPlay (implicitly via the CMakeLists.txt). I would recommend copy/pasting the workflows into the various repos and then modifying them as needed; don't try to have a master workflow at this stage.

yzhang-23 commented 1 year ago

I feel like we need to break this problem down and try to build up to a solution. So at this point I am going to suggest:

1. Add two workflows to .github which build the base images for ParallelZone.

* One workflow should be GCC based, the other clang based.

* The images should only contain the dependencies for ParallelZone, don't worry about other repos.

2. Add two workflows (one for GCC the other for clang) to ParallelZone which use the base images from the previous step to build ParallelZone and run the tests.

3. Add two different workflows to ParallelZone which build release images (one GCC-based, one clang-based) which trigger when master is updated.

4. Repeat step 2, but for PluginPlay and using ParallelZone's release images as the base images

5. Repeat step 3, but for PluginPlay

Each of the above steps can be a separate PR and it should be possible to accomplish those PRs without breaking the current CI pipeline. That's because repos which depend on PZ or PluginPlay should still be able to build since their CI includes steps for building PZ/PluginPlay (implicitly via the CMakeLists.txt). I would recommend copy/pasting the workflows into the various repos and then modifying them as needed; don't try to have a master workflow at this stage.

Several questions:

  1. For the base images with GCC and clang, do you mean the dependencies are compiled using GCC and clang, respectively?
  2. I think the release images do not contain the dependencies that necessary to build the repo, so they are different from the base (building) images, right?.

I'm also planning to update the CI repo by repo with a leap-frog strategy.

ryanmrichard commented 1 year ago
  1. If you build the dependency then yes it should be built with the respective compiler. If you can just pull the dependency from a package manager it doesn't matter. The GCC vs. clang distinction applies more to what compiler will be used to build our repos.
  2. ParallelZone's release image needs to include the shared libraries it linked to, otherwise when PluginPlay tries to link to Parallelzone you're going to get linker errors. I don't fully follow your second point. The base image for ParallelZone will not be the same as the base image for PluginPlay. ParallelZone's base image will have spdlog, MPI, and MADNESS. Keeping things simple for now, PluginPlay's release image will just be ParallelZone's release image (let PluginPlay build utilities and libfort).

I don't understand what updating by a leap-frog means in this case. The update process should be a breadth-first traversal of a tree.

yzhang-23 commented 1 year ago
  1. If you build the dependency then yes it should be built with the respective compiler. If you can just pull the dependency from a package manager it doesn't matter. The GCC vs. clang distinction applies more to what compiler will be used to build our repos.

    1. ParallelZone's release image needs to include the shared libraries it linked to, otherwise when PluginPlay tries to link to Parallelzone you're going to get linker errors. I don't fully follow your second point. The base image for ParallelZone will not be the same as the base image for PluginPlay. ParallelZone's base image will have spdlog, MPI, and MADNESS. Keeping things simple for now, PluginPlay's release image will just be ParallelZone's release image (let PluginPlay build utilities and libfort).

I don't understand what updating by a leap-frog means in this case. The update process should be a breadth-first traversal of a tree.

Yes, the release image of ParallelZone would have the shared libraries it linked to, such as MADNESS, spdlog, cereral, etc, but it won't have the building tools such as gcc and cmake. In this sense I can use the image to build ParallelZone as a base to build other repos (with other tools/packages added when necessary), not the release image of ParallelZone. The release image of ParallelZone, which contains the shared libraries (ParallelZone, MADNESS, spdlog, cereral, etc), can be added as a layer into the building image of PluginPlay, since PluginPlay depends on ParallelZone. It looks that we have different definitions for the building and release images, which is fine in my eyes. I just want to get a clear picture. When I said "leap-frog" I just meant I will try to update the CI workflows repo by repo, not to update them in one run.

ryanmrichard commented 1 year ago

Now that Docker allows image composition here's how I would do this.

image

Here each box is an independent image we need to maintain. Diamonds are literal libraries which are included in the image (libraries are not shown in all cases). The nesting shows the images they derive from. So:

The innermost images are parameterized on the compiler type (GCC or clang), the compiler version, the CMake version, and the version of the namesake library. Derived images inherit the parameters of their inner images. When building say PluginPlay base, you must then specify:

Then:

In practice, we have pinned the version of all the dependencies (and the versions live in the NWXCMake repo) so we do not have a combinatorial explosion. So the only parameters we'll change with any frequency are the compiler type, compiler version, and cmake version.

yzhang-23 commented 1 year ago

Now that Docker allows image composition here's how I would do this.

image

Here each box is an independent image we need to maintain. Diamonds are literal libraries which are included in the image (libraries are not shown in all cases). The nesting shows the images they derive from. So:

* MADNESS image derives from the MPI image,

* ParallelZone base image derives from the MADNESS and spdlog images,

* ParallelZone release derives from ParallelZone base

* PluginPlay base image derives from ParallelZone base, utilities, and libfort images

* PluginPlay release derives from PluginPlay base

The innermost images are parameterized on the compiler type (GCC or clang), the compiler version, the CMake version, and the version of the namesake library. Derived images inherit the parameters of their inner images. When building say PluginPlay base, you must then specify:

* compiler type

* compiler version

* cmake version

* utilities version

* libfort version

* parallelzone version

* spdlog version

* MADNESS version

* MPI version

Then:

* the compiler type, compiler verions, cmake version, parallelzone version, MADNESS version, MPI version, and spdlog version will be used to select the ParallelZone Release image

* the compiler type, compiler version, cmake version, and utilities version parameters will then be used to select the appropriate utilities image, and

* the compiler type, compiler versions, cmake version, and libfort version will then be used to select the appropriate libfort image

In practice, we have pinned the version of all the dependencies (and the versions live in the NWXCMake repo) so we do not have a combinatorial explosion. So the only parameters we'll change with any frequency are the compiler type, compiler version, and cmake version.

Ok, now I understand. In your design, base image = dependencies, and release image = base + compiled repo.

yzhang-23 commented 1 year ago

@ryanmrichard What gcc/g++ and clang versions should we cover in the building images? Now the building images are based on Ubuntu 20.04, so the gcc/g++ version no. ranges from 8 to 11 (correct?).

ryanmrichard commented 1 year ago

@yzhang-23 just worry about a single version for now, but parameterize the CI on the version so we can change it later.

yzhang-23 commented 1 year ago

I have redesigned the docker image building strategy: the building image of a repo used to have everything including the released dependent repos. However, as we want to build and test the repo both with gcc and clang, with the original building image setting, we have to store two different building images for each repo: one for gcc, and the other for clang. This could result in storage space issue. I think Ryan's suggestion of constructing a base image containing only the building tools but not the dependent repos is good. With such a base image, I can load the release images compiled with gcc or clang during the process of building the repo. In this way we don't need two different building images for gcc and clang. I'm trying to applying this new design into the workflows. Hope the testing won't take too much time. I will give updates.

yzhang-23 commented 1 year ago

I have a workflow in hand to parse the file in NWXCMake to obtain the package version info, but I would postpone its check-in until I find a good way to conveniently handle all versioned packages. Now I still have some minor issues, such as cannot separate the installation of gcc and clang, the exact version no. of some packages are missing, etc.

yzhang-23 commented 1 year ago

I wanted to run jobs within containers from images built on-the-fly (not pushed into some registries), but it turned out to be impossible. Please see the discussions at the GitHub Actions forum. I think what I need here is a docker container action in order to build/test our repos with temporary docker images not being pushed into some registry. I have written a toy container action to test this idea and it worked. I will convert our existing action for building/testing into a container action and see how it works.