worleyph commented 4 years ago

Recent experiments indicate that the performance of the physics in the current version of EAM can be improved in certain circumstances by optimizing the choice of the declared size of the first dimension (pcols) in the basic atmospheric physics data structure (chunk). (See Confluence page https://acme-climate.atlassian.net/wiki/x/wAZ_QQ .) While this was the case when the design was first implemented in the model in the early 2000's, the analysis then was that this had more to do with processor architecture (cache size or vector vs. RISC) than problem granularity (mesh resolution and degree of parallelism), and that a default value of pcols=16 was pretty good for most non-vector processors. The recent experiments instead indicate that setting pcols to minimize wasted space and, perhaps, the number of chunks, especially for the PG2 meshes and at scale, can improve upon the performance of the default sufficiently to make it worthwhile to start optimizing over pcols again, and to do so on the basis of problem granularity (as well as for processor architecture).

The other result from the original design experiments was that setting pcols at compile-time allowed the compilers (at the time) to generate more efficient code, leading to better performance than otherwise. This is the current design, and changing pcols requires a clean build of EAM. However, the code base has evolved significantly since the original experiments were conducted, with some data structures (still chunk-based) now using runtime-calculated first dimension size declaration, one example being psetcols. psetcols (and the other examples) is still determined by compile-time parameters (pcols and psubcols) but this may not be exploitable by the compiler. Compilers and the target processor architectures have also evolved over the last twenty years, so it is unclear as to whether the original empirical results are even relevant to current compilers and target processors.

With pcols being a compile-time parameter, optimization studies must include a rebuild in every experiment to determine the optimal setting for a given compset, mesh resolution, PE layout, and target system. This increases the effort (both turnaround time and amount of manual interaction) required for these studies compared to what would be required if pcols could be specified at runtime. There is also a desire to develop a better default value than 16, taking into account the mesh resolution and PE layout, for example. See Github issue

3485 ("Automated selection of pcols in EAM"). With the current

compile-time dependency, any "smart" setting of pcols must occur in the CIME layer, which will be more restricted than what can take place during runtime, when more is knowable about the particular case. With a runtime setting of pcols, we could also allow pcols to vary between the processes. I am not sure that this would be all that useful, but when load balancing is restricted, e.g. phys_loadbalance=0, or when running in a heterogenous environment, it might be useful.

Using compset FC5AV1C-L, I successfully modified EAM to use a runtime-specified pcols. (This is available as branch worleyph/cam/runtime_pcols.) I was not optimistic that this was even feasible when I began the effort, and was surprised at how little needed to be modified. However, this is not a definitive result as not all physics packages are enabled with this basic compset, and other packages may present implementation problems. The feasibility, therefore, is still to be determined.

The current implementation introduces an EAM namelist variable phys_chunk_fdim that can be set in user_nl_cam. The "old" compile-time parameter has been renamed as ppcols (but is still set with the CPP variable PCOLS or (equivalently) by adding the -pcols option to CAM_CONFIG_OPTS in env_build.xml), and is used only to define the default value of pcols. The runtime pcols can be larger or smaller than ppcols (but still must be >= 1). If we decide to move to a runtime setting of pcols, pcols will be added as an option in env_run.xml, with one or more options (pcols == -1, -2, ... ?) that will calculate a default value based on the best available information about the problem case and target architecture. This will address github issue #3485 . (The reason for multiple "smart" settings is that we might want to specify a maximum for memory optimization reasons as well as trying to minimize the number of chunks. Perhaps -64 could mean to calculate a space-efficient and chunk-count-efficient pcols, subject to pcols being no larger than 64.)

A compile-time option was also added (by defining a new CPP token PPCOLS) to ignore phys_chunk_fdim and still set pcols at compile-time. This does not entirely restore the original behavior as new code was introduced to allow for the runtime settimg of pcols, and this code is still executed even if pcols is defined at compile-time. The new code is primarily redefining some "global" module variables to be allocatable and then allocating them using pcols during initialization.

Whether this new runtime pcols capability is useful or not comes down to (a) whether there is a significant advantage to optimizing over pcols in production scenarios, (b) whether the improved ease of optimization when setting pcols at runtime is needed, i.e., how often would we need to do this, or whether the ability to determine the default at runtime based on the particulars of the case is a significant improvement over what can be determined at compile-time, and (c) whether there is a significant performance disadvantage to setting pcols at runtime, i.e., whether the performance degradation from using a runtime pcols offsets the advantages of optimizing pcols.

I ran performance experiments on Compy and on Cori-KNL to examine these questions, and wrote up the results on the Confluence page

https://acme-climate.atlassian.net/wiki/x/TwNzTg

The decision I would like help with is whether to

 a) leave well enough alone,
 b) add the runtime pcols option to master, but leave the default as 
     -DPPCOLS, then use the runtime option only for optimization 
     exercises, or
 c) add the runtime pcols option to master and make this the default, 
     thus trading off the potential performance degradation for the 
     increase in useability.

whannah1 commented 4 years ago

For the MMF case with the CRM running on the GPU, I think being able to optimize the chunk size [relatively] quickly would be valuable since we need to be able to pack a lot of CRM columns into each chunk in order to get the best speed-up. This functionality would likely be re-used many times because we are planning to start running on various new architectures in the coming years.

Also, some of us (including Mike Pritchard) are very interested in exploring dynamic load balancing that would allow different size CRMs to be used in different areas. Perhaps using two new run-time namelist variables could make this development smoother, such as having a "phys_chunk_algorithm=" in addition to the proposed "phys_chunk_fdim" variable that would be used when phys_chunk_algorithm='FIXED'.

I'm not sure I understand why we would need to keep the CPP variable to set PCOLS at compile time. Could there be some untested physics routines that are structured to use the compile time PCOLS in such a way that would require a major refactor?

ambrad commented 4 years ago

I'm OK with whatever decision is made. Even (a) is OK b/c run scripts tend to be reused a lot, so it's not that hard to get folks who are interested in a bit more speedup to insert something like this in the script:

   @ pcol = (4 * $nelem + $ntask - 1) / $ntask
    if ($pcol > 40) then
        @ pcol = 16
    endif
    if ($pcol > 20) then
        @ pcol = ($pcol + 1) / 2
    endif

and then get the benefit over many runs. So the speedup will be available and useful regardless of path forward.

worleyph commented 4 years ago

I'm not sure I understand why we would need to keep the CPP variable to set PCOLS at compile time.

There is a performance loss from setting pcols at runtime. For production runs, it might be desired to set pcols at compile-time to maximize throughput. While my measurements indicate only 1-2% loss, this was on only two systems and both using the Intel compiler.

worleyph commented 4 years ago

@singhbalwinder , what compsets and/or modifications to CAM_CONFIG_OPTS should I look at, to check whether other code needs to be modified to support the runtime setting of pcols? Thanks.

singhbalwinder commented 4 years ago

@worleyph: Thanks for the nice summary. I think the idea here is to mention PCOLS as a runtime time constant using CAM_CONFIG_OPTS as one of the options. In that case, I think all CAM_CONFIG_OPTS which has -nlev 72 are the ones developed by E3SM and related projects. These compsets generally have av1c or at least v1 string in their names.

I have seen some folks still using old compsets with FV dycore. I am not sure if those should still be supported or we can handle these on case by case basis.

worleyph commented 4 years ago

@singhbalwinder , thank you for the quick response. At this stage, I prefer not to just invoke the integration test suite - I would rather try one compset at a time to verify that the runtime pcols works or to determine that more code modifications are needed. I was hoping for a short list of important chemistry packages or other physics parameterizations that are not included in FC5AV1C-L (and pointers to how to enable them).

Don't know if this is what you meant - but a great idea to use an explicit inclusion of -pcols in CAM_CONFIG_OPTS to force the addition of -DPPCOLS. I've been modifying Macros.cmake manually so far. I'll add this right away. Thanks!

singhbalwinder commented 4 years ago

@worleyph: I can't think of any other chemistry package being used by E3SM except some being worked under NGD. NGD chemistry packages are not in master yet. You can test -chem none (prescribed aerosols package) if we need one more way to test your new implementation.

On physics side as well, I am not aware of any new developments except the ones being worked on under NGDs. Apart from FC5AV1C-L, you can test Aqua planet (e.g. F-EAMv1-AQP1) and single column configuration (e.g. FSCM5A97). Single column configuration is actually using -DPCOLS=1 in CCSM_cppdefs (Buildconf/camconf/CCSM_cppdefs file in the case directory), it is probably being added by configure script of eam.

worleyph commented 4 years ago

Thanks! I already tried COSP. Didn't know if there was anything else to try. I'll assume that I am good to go (assuming we decide to proceed with this).

whannah1 commented 4 years ago

@worleyph If you're testing aqua planet (F-EAMv1-AQP1) you might as well also test the RCE compset (F-EAMv1-RCEMIP) since that's somewhat different.

worleyph commented 4 years ago

@whannah1 , F-EAMv1-RCEMIP is broken (in the hash that I am using):

 ... This name does not have a type, and must have an explicit type.   [DEP_INPUTS]
 call wetdep_inputs_unset(dep_inputs)

in subroutine aero_model_surfarea, in aero_model.F90. The error message is correct - dep_inputs in not declared in this routine. It is declared in subroutine aero_model_wetdep, where

 call wetdep_inputs_set( state, pbuf, dep_inputs )
 ...
  call wetdepa_v1( state%t, state%pmid, state%q(:,:,1), state%pdel, &
       dep_inputs%cldt, ...

but only as a local variable. The compiler would probably complain about the call to wetdep_inputs_unset next, as the routine aero_model_surfarea also does not include the module wetdep, so wetdep_inputs_unset is not defined.

whannah1 commented 4 years ago

That's a shame, especially since aerosol effects are effectively disabled in that compset. I'll put this on my to-fix list.

worleyph commented 4 years ago

I've been looking at generalizing setting pcols, from specifying via the namelist argument (which is working) to calculating a good choice and possibly allowing it to vary between processes. This generalization has to be done in phys_grid_init. Unfortunately, this is "too late", as the routine phys_register is called first and needs to know pcols. Setting via a namelist parameter works because read_namelist is called before phys_register.

I'm looking now to see whether the call to phys_register can be moved to after the call to phys_grid_init. This is in cam_init, in control/cam_comp.F90, where phys_grid_init is called in either cam_initial and can_read_restart, both also called in cam_init. If anyone has any insight, please speak up :-).

worleyph commented 4 years ago

I got an initial run to work by moving phys_register after phys_grid_init, and also by moving the pbuf_add_field calls for FRONTGF and FRONTGA out of dyn_init1 to right after phys_grid_init. I haven't tried a restart run yet.

Does anyone know the history of use_fw_front and why these pbuf_add_field calls are in a dynamics routine instead of in phys_register? (@mt5555 ?) Can I move them to phys_register? These seem to be dycore specific, and the indices frontgf_idx and frontga_idx are also defined in dyn_comp, but should be possible to do this?

whannah1 commented 4 years ago

@worleyph The dynamics provides input to the frontal gw scheme, and these have to be mapped by the physgrid algorithm. So the dynamics and the dyn/phys coupling code needs to know if that scheme is going to be used.

worleyph commented 4 years ago

@whannah1 , thanks. But the pbuf_add_field calls do not appear to be needed so early? At least for an initial run, it works fine if they are moved to after the call to phys_grid_init.

whannah1 commented 4 years ago

@worleyph are you using the physgrid in these tests? I need to look more carefully, but what you're saying makes sense.

worleyph commented 4 years ago

are you using the physgrid in these tests?

Not sure what this means - I am using compset FC5AV1C-L with ne30_ne30 .

whannah1 commented 4 years ago

@worleyph that grid is not using the physgrid. Try running the same compset with "-res ne30pg2_ne30pg2"

worleyph commented 4 years ago

Thanks. Will do.

worleyph commented 4 years ago

@whannah1 , initial runs are working with physgrid as well (ne30pg2_ne30pg2).

whannah1 commented 4 years ago

ok, thanks for checking

worleyph commented 4 years ago

@whannah1 and @singhbalwinder and @mt5555 , I've worked out a proposal to reorder some steps in the initialization, to allow pcols to be set inside of phys_grid_init:

moved the definition of the frontgf_idx and frontga_idx indices into physpkg.F90 as public module variables.
moved the pbuf_add_field commands for frontgf_idx and frontga_idx into the end of phys_register.
moved phys_register call to immediately after the call to phys_grid_init, and put both of these at the end of dyn_init1
renamed dyn_init1 to be dyn_physgrid_init (but left dyn_init2 as is)

Note that this simplifies the logic surrounding phys_grid_init, which previously was called in cam_initial (in dynamics/se/inital.F90) for initial runs and in cam_read_restart (in control/cam_restart.F90) for restart runs. The three (dyn_init1/phys_grid_init/phys_register) are necessarily bundled together, and in this order, so no reason not to do so explicitly.

I can do this in the runtime pcols PR, or propose this in a separate PR, though the rationale for doing so is not obvious separate from the runtime pcols PR.

worleyph commented 4 years ago

or rename dyn_init1 to be dyn_phys_init1, and then change phys_init to phys_init2, so now there are

 dyn_phys_init1
 dyn_init2
 phys_init2

worleyph commented 4 years ago

While I can't test the changes (that I know of), I'd like to make changes to the other dycores (fv, eul, sld) that are required based on the reordering of the intialization logic. This is straightforward (for fv, at least - can't really tell for the others) if I back off some of the changes described earlier. The list is now:

moved the phys_register call to immediately after the call to phys_grid_init, and put both of these in dyn_init1 at the very end;
moved the pbuf_add_field commands for frontgf_idx and frontga_idx at the beginning of dyn_init1 to the end, after the call to phys_register;
renamed dyn_init1 to be dyn_phys_init1 (but left dyn_init2 as is).

so number of changes is smaller. This is working for both initial and restart runs for se.

worleyph commented 4 years ago

So, I just submitted a PR for "part 1" of adding pcols as a runtime option. This sets the runtime pcols during read_namelist, so early on. I'll continue developing the version where pcols is set during phys_grid_init, and propose a follow-on PR (if the first one makes it in).

worleyph commented 4 years ago

You can test -chem none (prescribed aerosols package) if we need one more way to test your new implementation.

@singhbalwinder , building with -chem none failed in the same way as F-EAMv1-RCEMIP, with

 xxx/cam/src/chemistry/bulk_aero/aero_model.F90(974): error #6404: This name does not have a type, and must have an explicit type.   [DEP_INPUTS]
     call wetdep_inputs_unset(dep_inputs)

which is an error in the hash I checked out. It appears to have been fixed in master now. (call to wetdep_inputs_unset is now deleted - it was never needed??? The subroutine doesn't even exist any more.)

worleyph commented 4 years ago

My apologies @singhbalwinder and @whannah1 , this is my bug. The routine wetdep_inputs_unset is something that I added, and I put the call in the wrong location. I'll fix, test, and push a commit as soon as I can.

worleyph commented 4 years ago

@whannah1 , now working with F-EAMv1-RCEMIP. Sorry for the false alarm.

whannah1 commented 4 years ago

@worleyph, I'm relieved to here RCE isn't broken, so thanks for that update.

As for your proposed changes, I'm curious why you opted to make frontgf_idx and frontga_idx public module variables in physpkg instead of just making the routines that need them use local variables and call pbuf_get_index()?

Also, I'm not sure I understand how you plan to change or move phys_grid_init(). I'm mostly curious how this affects the restart. I vaguely remember running into a minor issue with the physgrid initialization for restarts, so I just want to make sure this movement won't create a similar issue.

worleyph commented 4 years ago

@ambrad and @singhbalwinder , in my development sandbox I currently have the namelist variables

phys_chnk_fdim
phys_chnk_fdim_max
phys_chnk_fdim_mult

if phys_chnk_fdim is > 0, then pcols = phys_chnk_fdim. Otherwise, the code tries to do something clever, calculating pcols that minimizes wasted space and the number of chunks. (When phys_loadbalance /= 2, the calculated pcols can vary from process to process.)

The other namelist parameters modify the calculation, so that

 pcols <= phys_chnk_fdim_max

and

 pcols is an integer multiple of phys_chnk_fdim_mult

(defaults are phys_chnk_fdim_max = -1 (i.e., ignore) and phys_chnk_fdim_mult = 1). To capture the results of the pcols evaluation studies I can add

<phys_chnk_fdim_max dyn="se" npg="2"> 16 </phys_chnk_fdim_max>

to namelist_defaults_cam.xml (so that pcols <= 16 for ne30pg2, for example). However, I'd also like a system-specific rule, e.g. something like

<phys_chnk_fdim_max mach="compy" dyn="se" npg="0"> 24 </phys_chnk_fdim_max>

since Compy prefers pcols < 24 while Cori-KNL is prefers larger (e.g. 74 for 675x1 and ne30). I don't see any support for system-specific namelist defaults. I see compiler and OS (and OS is sufficient for bgl, bgp, etc.), but nothing to distinguish between Compy and Cori-KNL, or even Cori-KNL and Cori-Haswell. Am I missing something? Would there be a way to add something like this? MACH is a variable known in other parts of E3SM.

worleyph commented 4 years ago

As for your proposed changes, I'm curious why you opted to make frontgf_idx and frontga_idx public module variables in physpkg instead of just making the routines that need them use local variables and call pbuf_get_index()?

Looking at the rest of the logic, it seemed inconsistent - use_gw_front is defined in phys_control, so this just seemed to be a "physics thing" and the pbuf_add_field calls seemed like they should live in phys_register, like everything else. (This may not really answer your question - the design/usage of pbuf_add_field and pbuf_get_index is not something that I familiar with.)

However, I changed my mind and put these back into dyn_comp.F90, to make it easier to "not break" the other dycores with this change - not that we are trying to support these anyway. The pbuf_add_file calls just got moved to the end of dyn_init1 (now dyn_phys_init1), after calls to phys_grid_init and phys_register.

worleyph commented 4 years ago

Also, I'm not sure I understand how you plan to change or move phys_grid_init(). I'm mostly curious how this affects the restart. I vaguely remember running into a minor issue with the physgrid initialization for restarts, so I just want to make sure this movement won't create a similar issue.

See the earlier "correction" to my plans

moved the phys_register call to immediately after the call to phys_grid_init, and put both of these in dyn_init1 at the very end;
moved the pbuf_add_field commands for frontgf_idx and frontga_idx at the beginning of dyn_init1 to the end, after the call to phys_register;
renamed dyn_init1 to be dyn_phys_init1 (but left dyn_init2 as is).

I tested with restarts as well. This motivated the move of the phys_grid_int and phys_register calls into dyn_init1. Otherwise, for initial runs the changes went into inital.F90, while for restarts they went into cam_restart.F90 . It simpflified everything to put it into dyn_init1.

I've tested restarts for ne30. I'll do the same for ne30pg2 in a moment.

worleyph commented 4 years ago

the ERS test worked with ne30pg2_ne30pg2 with my current set of changes. Thanks for suggesting this.

whannah1 commented 4 years ago

@worleyph that's great, thanks for checking.

worleyph commented 4 years ago

@singhbalwinder , I figured out how to add system-specific rules for setting namelist defaults. It required modifying configure (to add a -mach parameter) and being creative with config_component.xml (to add -mach $MACH as a base element to CAM_CONFIG_OPTS for all compsets, via

  <value compset=""               >-mach $MACH</value>

I'll be able to do something similar if compiler-specific rules are also required.

SO, @ambrad, I can reproduce your script-level algorithm for pcols at runtime now with some rules like:

 <phys_chnk_fdim_max dyn="se" npg="2"> 16 </phys_chnk_fdim_max>

in namelist_defaults_cam.xml. For the non-npg=2 cases, I can now differentiate between Compy and Cori-KNL with, for example,

 <phys_chnk_fdim_max mach="compy" dyn="se" npg="0"> 24 </phys_chnk_fdim_max>

(and and not have an upper bound for Cori-KNL). I don't know whether the parsing is case sensitive or not - I'll have to experiment with that.

However, with this, I'm probably ready for a part 2 runtime pcols PR. Have to wait for the first one to be accepted and merged first though. I'll keep testing this one in the meantime.

singhbalwinder commented 4 years ago

@worleyph : I am a little confused as to why we need different settings for different machines? Is it due to performance reasons? Thanks!

worleyph commented 4 years ago

@singhbalwinder , yes. If you look at

https://acme-climate.atlassian.net/wiki/x/TwNzTg

for the ne30_ne30 experiment (experiment (B)), Cori-KNL prefers large pcols, while Compy does not. Want to be able to capture this. Cori-KNL may also prefer that pcols is a multiple of 2, and Compy not so much. (This later result may be seeing patterns that aren't really there however.)

singhbalwinder commented 4 years ago

Thanks! That makes sense. So the idea here is that we will have a default PCOLS setting and then machine specific settings will override that default, right?

worleyph commented 4 years ago

Correct. The runtime default will be to calculate the pcols that minimizes wasted space and the number of chunks, then special cases (where we know better) can be implemented by changing the defaults for phys_chnk_fdim_max and/or phys_chnk_fdim_mult. Note that if we know what pcols should be, we can set this directly with phys_chnk_dim in user_nl_cam (runtime) or with -pcols in CAM_CONFIG_OPTS (compile-time). The only downside of setting pcols directly is that this applies across all processes (no process-specific possibility). This is what you want for phys_loadbalance = 2 (the default), but you can do better for the other load balancing options, if these other load balancing options are of interest.

E3SM-Project / E3SM

setting PCOLS at runtime (discussion) #3538

3485 ("Automated selection of pcols in EAM"). With the current