ESMCI / cime

Common Infrastructure for Modeling the Earth
http://esmci.github.io/cime
Other
162 stars 207 forks source link

Add ability to have CIME read an external-to-code site-specific configuration #3981

Closed briandobbins closed 1 year ago

briandobbins commented 3 years ago

Right now, I believe CIME pulls configuration files (eg, config_machines.xml, etc) from the CESM/E3SM source directory, with an overload (append?) via the user's ~/.cime directory. I think it'd be beneficial to also allow for an environment variable to point towards a site overload.

The primary benefit of this is for the container/cloud configurations. In the cloud, the problems are fairly minor, since the entire platform is standardized, with only minor differences in hardware - eg, we need some way to tell the 'cloud' machine entry that this cloud system has 24 cores per node, whereas another has 48 cores per node. This may not actually impact anything other than utilities like 'preview_run', but it seems worth fixing.

The harder situation is with the HPC containers - the underlying platforms (eg, university clusters, new supercomputers, etc) are all very different in terms of things like paths, queuing systems, MAX_TASKS_PER_NODE, etc, which need some level of customization. If places are running the container as an image (vs as a sandbox), they can't modify the internal CIME configuration files, and having every user need to set up their ~/.cime directory seems heavy-handed.

So, for discussion, I'm suggesting something akin to (if it doesn't already exist?) a CIME_CONFIG_DIR environment variable that, if set, overrides or supplements settings in the 'config' directory within the code itself. This obviously still requires a minor amount of work, but it's much simpler - no software installation, just a few settings, and can be done system-wide, vs per user.

In a nutshell, I view it as another tier of CIME configuration, but one allowing out-of-source-tree site customization. I originally felt it should go at the top of the hierarchy, so it's CIME_CONFIG_DIR -> CIME's 'config' directory -> user's ~/.cime directory, but that could cause some chaos if people have a local copy of the code and have that variable set, so the safer option seems to be: CIME's 'config' dir -> CIME_CONFIG_DIR -> ~/.cime directory.

One other minor wrinkle could be, if it doesn't already, we only do hostname matching if CIME_MACHINE is not set. This would avoid problems where you're running Singularity + CESM container on Cheyenne, and it matches, but you want the 'container' machine (via CIME_MACHINE=container) vs the native 'cheyenne' match per hostnames.

Thoughts?

billsacks commented 3 years ago

For a slightly different use case last year, I implemented an --extra-machines-dir argument to create_newcase that I think does something similar to what you want (https://github.com/ESMCI/cime/pull/3508; see also some rationale in https://github.com/ESMCI/cime/issues/3493). This gives an extra location where you can put config_machines.xml, config_compilers.xml and config_batch.xml files (but not the full set of files allowed in the ~/.cime directory).

My impression from your comment is that it could help to have this location set via an environment variable, maybe so that users can remain blissfully unaware of this location – or maybe so they can set it once and then forget about it? As much as we try to avoid the use of environment variables, I personally think this could be a reasonable use of one if there is a practical benefit: Since this environment variable applies before cime has done anything – and is needed for cime to know how to do anything substantive on the machine – it feels like there aren't good alternatives to an environment variable, unless it's reasonable to require the user to specify the --extra-machines-dir argument manually.

If you want to basically use the implementation of --extra-machines-dir but add an environment variable, I think it would be relatively simple to add code in create_newcase that uses the given environment variable by default for that flag.

If I remember correctly, the current implementation – both of the ~/.cime directory and of the directory specified by --extra-machines-dir – involves appending to the existing machine files rather than replacing them. It's important to do it that way because the user-specific machines files typically leverage some general settings in the default files (such as settings that always apply for the intel compiler, regardless of what machine you're on).

briandobbins commented 3 years ago

Thanks for your thoughts, Bill. I'll look into your '--extra-machines-dir' implementation a bit just to get ideas and a better understanding of how it works. I do think having an environment variable makes sense, mostly so users don't even have to set it - eg, sysadmins can install CESM, customize their environment via the directory specified in the environment variable, and set that variable in a module. Then, something like 'module load cesm' would set the variable, and users would be none-the-wiser unless they cared to dig. The plus here is (unless we make big changes to the CIME variables), that one location should suffice even if they do a checkout of newer versions of CIME in their home directory; it's not actually tied to a single release, I think.

If I remember correctly, the current implementation – both of the ~/.cime directory and of the directory specified by --extra-machines-dir – involves appending to the existing machine files rather than replacing them. It's important to do it that way because the user-specific machines files typically leverage some general settings in the default files (such as settings that always apply for the intel compiler, regardless of what machine you're on).

That makes sense, and I assume, then, it prefers the last match of a variable? That way if I set something in ~/.cime that's already set elsewhere, it overrides it, but doesn't require everything to be set there, since by the time I'm reading ~/.cime, I have default values from the 'cime/config' directory? If so, that should work - we obviously want to keep things like the compiler flags, yes, but override things like MAX_TASKS_PER_NODE, and likely everything in config_batch.xml, among others.

Anyway, thanks! It seems this is worth pursuing, and as I get into the implementation, I'll surely run into new questions. I'll update once I have a working version, but it's a bit low on my priority list right now, so it might be a while.

billsacks commented 3 years ago

I assume, then, it prefers the last match of a variable?

I'm pretty sure that's how it works.

One other consideration here is how to ensure stability and reproducibility of a given machine configuration. For example, if a sysadmin wants to change the default compiler version, this will need to be done in a way that (1) doesn't impact currently-running cases, and (2) any given configuration can be documented and restored for the sake of reproducibility. This isn't really an issue with your proposed cime implementation as much as it is something that the sysadmins and/or users would need to keep in mind. But I wanted to mention it because these are some of the nice things that cime gives you that you can lose by maintaining machine ports outside of the cime config area.

briandobbins commented 3 years ago

Thanks again - in terms of stability and reproducibility, I guess this is something to think about, but it shouldn't impact the containerized versions of the model, since they embed the compilers, libraries, etc. It'll only impact source versions where someone is using the CIME_CONFIG_DIR external setup and that's been changed by a sysadmin. I think this is a very uncommon scenario; if they wanted to change something, they could make a second directory, a new module, and have that one point to the new, changed directory, leaving everyone using the old directory unaffected.

That said, one way to do this would be to just allow the 'logistical' variables to be set in this site-specific directory - eg, things like the number of tasks per node, the queues, the directories (which, really, for containers might be a new variable of 'mount points' vs trying to parse paths), etc. Nothing that should be answer-changing. This seems like a lot of work to 'curate' which variables can be there, and a lot of effort to handle questions about it, for a scenario I feel is pretty unlikely, so I think for now it's not something to dwell on too much.

I'll work on getting a version of the first set up, and maybe partner with a university or two on testing it.

Thanks again!

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.