ACCESS-NRI / dev_coupling

0 stars 0 forks source link

Running CIME DATM-MOM6-CICE6-WW3 compset in Docker container #1

Closed dougiesquire closed 1 year ago

dougiesquire commented 1 year ago

Mariana Vertenstein (CESM Software Engineering Group Head, NCAR) provided the following instructions for running CESM with CMEPS/CDEPS within a container:


So when you want to use CMEPS and CDEPS you will need a workflow. I am working to create a container based application using CESM and the CIME Case Control System that will demonstrate what you need. But in the meantime - it would be really helpful if we could have a quick chat so that I can clarify some issues. Denise and I are experts in the caps and Brian is the expert in terms of containers. So given that I think we can help get you familiar with the type of examples you want to see.

Brian (cc'd here) has provided the container application and the directions for using the container. To see the code base illustrating the CEMS usage aof CMEPS/CDEPS - you can simply do the following:

1) Download & Install Docker desktop (Mac/Win) (If you using Linux, there are some gotchas depending on version -- so let us know before you go on)

2) Once Docker is running, get the image we'll use ('cesm_common_preview'). The name is a bit odd because this version is still new, for CESM 2.3+ docker pull escomp/cesm_common_preview

3) Launch the container - I'm showing the terminal / command line version here, but you can use a GUI if needed: docker run -it --rm -v ~/my_directory:/home/user escomp/cesm_common_preview

What this command does is it runs docker ('docker run') in an 'interactive, TTY' mode ('-it'), with a removal of the container ('-rm') afterwards -- more on that in a second -- and then it mounts in a directory from your local system ['my_directory'] putting it on /home/user ('-v ~/my_directory:/home/user'), and then the rest is just the name of the container ('escomp/cesm_common_preview'). There is one curiosity - the '--rm'. Basically, when you launch an image of a container, it creates an instance of that container, including a writeable file system - eg, you can modify the stuff inside that image. We don't care to hold on to any accidental changes, since we're doing everything via a mount point, so we're just going to remove this instance of the container afterwards. It doesn't delete the image, and won't change your files, since they're mapped in. In short, don't stress about it - it just saves you some disk space.

4) Getting the code: You should have a command prompt inside the container, you you can just do:

git clone -b cesm2_3_beta09 https://github.com/ESCOMP/CESM
cd CESM
./manage_externals/checkout_externals -v -o

This should give you a copy of CESM. You'll need, of course, to accept an SVN key for some of the code, but this is a one-time thing (unless you mount in a different directory and check out CESM again). From here, you can set up a case and build it, and it all should just work.

5) Looking at the code

To see the CESM NUOPC driver: cd components/cmeps/cesm/driver/ you want to look at esmApp.F90 and esm.F90 for the CESM driver - there are two layers since we support multi instance functionality.

To see CMEPS: cd components/cmeps/mediator note that to configure CMEPS for CESM the CIME/CCS scripting is in cmeps/cime_config (that is what I'd love to go over in a chat so you can understand how we configure cmeps)

To see the prognostic atm cap for CAM cd components/atm/cam/src/cpl/nuopc there are only two files here - atm_comp_nuopc is for the general NUOPC phase definitions and atm_import_export is the code that does the translation between CAM and NUOPC data structures

To see the CDEPS code base for DATM cd components/cdeps/datm atm_comp_nouopc is the general cap code but there are also separate modules that are in place for the different modes that are used. For JRA forcing you would use datm_datamode_jra_mod.F90.

6) Creating cases:

A full data model configuration (DATM, DICE, DOCN, DROF) each interpolating forcing data from different streams to the model grid

cd cime/scripts
./create_newcase --case atest --compset A --res f19_g17_rx1 --machine container
cd atest
./case.setup
./case.build
./case.submit

This is the simple test I think you want to look at

A case using DATM, MOM6, CICE6 and WW3

cd cime/scripts
./create_newcase --case gtest --compset GMOM_JRA_WD --res T62_g16 --machine container --run-unsupported
cd gtest

Step (6) is the use of the CIME case control system - (see http://esmci.github.io/cime/versions/master/html/index.html)

The following is also from Brian: One thing to be aware of is if they're on a larger system (more cores), then you CAN still run... but by default, Docker only gives 64MB to the /dev/shm partition where MPI often tracks its ranks. With MPICH I tested 8 fine out of the box, and seemingly that's fine for OpenMPI too, but with Intel MPI I needed more. It's just an additional command-line argument that needs to be specified if you're running on a workstation with a lot of cores, but it is a known thing to be aware of.


The full data model configuration runs out-of-the-box. However, the DATM-MOM6-CICE6-WW3 case fails during the build (for me at least - and presumably for everyone given that this is running in a container...).

dougiesquire commented 1 year ago

There are two issues that arise when building FMS. Quick dirty fixes to get the case running are given below:

  1. The container uses a recent version of glibc (>2.30), which provides a wrapper on the gettid() system call. However the FMS version used defines a static gettid() function in FMS/src/affinity/affinity.c that causes a compatibility error (see https://github.com/NOAA-GFDL/FMS/issues/426). This can be solved by building FMS with autotools (see https://github.com/NOAA-GFDL/FMS/issues/276), which conditionally sets the -DHAVE_GETTID directive, but that is not done within the container.
    Dirty fix : Remove the static keyword in front of gettid in FMS/src/affinity/affinity.c (line 46)

  2. The fortran compiler seems to require that symbols in namelists are explicitly declared prior to the declaration of the namelist. This is not done in FMS/src/topography/topography.F90. Some general discussion here: https://community.intel.com/t5/Intel-Fortran-Compiler/Intel-compiler-Funny-bug-with-namelist-non-standard-code-and/m-p/1266664#M155165.
    Dirty fix : Move the declaration of use_mpp_io (line 110) in FMS/src/topography/topography.F90 to before the namelist statement (line 104)