earth-system-radiation / rte-rrtmgp

RTE+RRTMGP is a set of codes for computing radiative fluxes in planetary atmospheres.
BSD 3-Clause "New" or "Revised" License
74 stars 67 forks source link

Update CI #200

Closed skosukhin closed 1 year ago

skosukhin commented 1 year ago
  1. Fixes caching of RFMIP files. Currently, the files are cached but not used: each job still downloads the files.
  2. Installs zstd on top of the container images, which makes CI and Containerized-CI use the same cache (see here why).
  3. Introduces caching of the Conda environment. This saves ~500MB of incoming traffic per job (~4x500MB per workflow run).
  4. Removes entries from environment.yml that are not needed to run the CI jobs. This reduces the cache file from ~500MB to ~280MB. I guess we can reduce it further if we switch to apt-get install and pip install.
  5. Replaces custom installation of HDF5, NetCDF-C and NetCDF-Fortran in the CI jobs with Ubuntu package libnetcdff-dev, which works because Fortran module files for Gfortran 9 and Gfortran 10 are compatible.
  6. Drops installation of Gfortran in doc-deployment.yml because it seems redundant to me but I might be wrong: I haven't thoroughly compared the results before and after the change.
  7. Makes the environment module manipulations on Daint (self-hosted-ci.yml) easier (hopefully). The exact mechanism is used in ICON development.
  8. Switches to a newly installed Python environment on Daint, which saves cumbersome environment manipulations before we can start using it.
  9. Introduces general refactoring of the CI configurations: they look similar to each other as much as possible, which is supposed to make it easier to understand what we test.

Tip: it might happen at some point that the cached files (i.e. RFMIP or Conda) do not match changes in the code and the workflows will start failing because of that. Some time ago, it was not possible to update/remove the cache files. The only way to tell the jobs that they should not use the existing cache but download the files on their own was to change the key. Now, we can easily remove invalid/obsolete caches here.

Recommendations on the containers:

  1. Intel container should source /opt/intel/oneapi/setvars.sh (e.g. in ~/.profile) so that the respective CI jobs don't have to do that over and over again.
  2. The containers should have zstd installed for the same reason.
  3. ~It makes sense to reduce the size of the containers by appending && rm -rf /var/lib/apt/lists/* to the last apt-get install command of a container.~ (the reduction in size is negligible)
  4. ~It's a good general practice to create a non-root user in the container and switch to it. In my opinion, it makes sense to adopt that.~ (see https://docs.github.com/en/actions/creating-actions/dockerfile-support-for-github-actions#user)
RobertPincus commented 1 year ago

@skosukhin This is wonderful, thanks very much. On the containers, if you're able to open a pull request against https://github.com/earth-system-radiation/containers I'm happy to consider it. I can make a branch for you to target if you want.

RobertPincus commented 1 year ago

@skosukhin You'll see that everything now runs on Ubunutu 22.04 and that I've followed your lead and use libnetcdff-dev installed via apt-get in the gfortran testing via Github actions

skosukhin commented 1 year ago

This now depends on https://github.com/earth-system-radiation/containers/pull/9.

skosukhin commented 1 year ago

This also adds testing with ifx. The compiler flags might require some adjustments though.

skosukhin commented 1 year ago

Several more changes:

  1. I have to take https://github.com/earth-system-radiation/rte-rrtmgp/pull/200#discussion_r1064809323 back: it's a bad idea to set variables in the matrix if it's possible not to do that. I have stumbled upon several unwanted side effects, which were not so straightforward to workaround. So, I suggest the following policy for matrix and environment variables: don't use them if possible. For example, if an environment variable is used in one step only, it should not be set. Instead, the value should be used in the step literally. If an environment variable has to be set but all jobs need the same value, it should be set directly, without extra matrix variables.
  2. The Self-hosted CI workflow is now marked green in the Actions although cce-gpu-openmp expectedly fails. The workflow is supposed to be red if any other job fails. Note that commits and merge requests are still marked red.
  3. A couple of Python scripts now make several attempts (instead of just one) to download files before failing. It looks like the server rejects connections when there are too many of them, which sometimes happens when multiple CI jobs are started at once.
RobertPincus commented 1 year ago

Several more changes:

  1. A couple of Python scripts now make several attempts (instead of just one) to download files before failing. It looks like the server rejects connections when there are too many of them, which sometimes happens when multiple CI jobs are started at once.

This is great and addresses #107.