Refactor nems field exchange; set default masks for mapping in med_internalstate

DeniseWorthen commented 2 years ago

Description of changes

Refactors esmFldsExchange_nems.F90 to use separate advertise and initialize phases and to check that a component is present before advertising a field to or from that component. Implements default src and dst mask values in place of the code currently in med_map_mod.F90

Specific notes

fixes #63
fixes #64

Are changes expected to change answers? (specify if bfb, different at roundoff, more substantial)

No

Any User Interface Changes (namelist or namelist defaults changes)?

No

Testing performed

Testing performed if application target is CESM:

[x ] (other) please described in detail
- machines and compilers cheyenne - prealpha tests with nuopc driver only
- details (e.g. failed tests):

Testing performed if application target is UFS-coupled:

[X] (recommended) UFS-coupled testing
- description: Tested at UWM hash e3b19c11 using this PR branch for CMEPS
- details (e.g. failed tests): all coupled, hafs and ng-godas tests pass for both Intel and GNU

Testing performed if application target is UFS-HAFS:

[ ] (recommended) UFS-HAFS testing
- description:
- details (e.g. failed tests):

Hashes used for testing:

[ ] CESM:
- repository to check out: https://github.com/ESCOMP/CESM.git
- branch/hash: cesm2_3_alpha09a - but with the following changes to hashes:
- share: share1.0.11
- CDEPS: cdeps0.12.45
- CIME: e42cdfd75
[X] UFS-coupled, then umbrella repostiory to check out and associated hash:
- repository to check out: https://github.com/ufs-community/ufs-weather-model.git
- branch/hash: 6b1787f
[ ] UFS-HAFS, then umbrella repostiory to check out and associated hash:
- repository to check out:
- branch/hash:

uturuncoglu commented 2 years ago

@DeniseWorthen I need to update my exchange grid fork based on these changes and these it again. Are you plaining to update CMEPS in UFS after merging this PR?

DeniseWorthen commented 2 years ago

@uturuncoglu I ran the UFS HAFS regression tests and they all passed. Also, I will make a PR back to UWM with these changes. That will also bring back any other update that has been made since I last updated the EMC fork (Feb 3).

uturuncoglu commented 2 years ago

@DeniseWorthen that is great. Do I need to create PR to authoritative repo or NOA-EMC fork? I could not be sure. Of course, it would be draft, at this point.

DeniseWorthen commented 2 years ago

@uturuncoglu I'm not 100% sure where in CMEPS your xgrid work touches. Do you need changes in FldsExchange_nems? I'm happy to work w/ you at getting those put in now (even if not functional) if that saves effort.

uturuncoglu commented 2 years ago

Yes, there are some mods in FldsExchange_nems since I introduced two new coupling model. I have also some change in flux computation part under ufs/ directory.

mvertens commented 2 years ago

@uturuncoglu @denise - this must be tested with cesm as well. We need to fire off the prealpha tests using cesm2_3_beta08 as a baseline. @uturuncoglu - are you willing to do this? If not I can take this on.

uturuncoglu commented 2 years ago

@mvertens sure. I could run it and let you know.

mvertens commented 2 years ago

Thank you! You will need to merge in the latest cmeps master into this PR to have this working - but that should be part of the testing. Does that make sense?

uturuncoglu commented 2 years ago

@mvertens i checkout CMEPS master and merge with @DeniseWorthen branch. So, it would be fine at this point.

mvertens commented 2 years ago

@uturuncoglu - that sounds great. Thank you.

uturuncoglu commented 2 years ago

@mvertens it will take longer than I thought. I have an issue with my disk quota since I am keeping all 35-days long runs for exchange grid work. I'll try to solve them first and start tests again.

mvertens commented 2 years ago

@uturuncoglu - no problem. Thank you so much for doing this!!! Let me know if you want me to take this on if it gets too complicated on your end.

DeniseWorthen commented 2 years ago

@uturuncoglu There is nothing time-critical in this PR on my side, it is just work I started on as part of the Wave coupling and I thought I'd take the time to get it committed. If it is easier to proceed on the X-grid changes w/o the changes in this PR, that is fine too. We can always circle back to it.

uturuncoglu commented 2 years ago

@DeniseWorthen Thanks. No, that is fine for me. I think this must go first and then I could make required modifications for the exchange grid. I'll update all once I run CESM pre-alpha tests.

uturuncoglu commented 2 years ago

@mvertens @DeniseWorthen I run all tests and here is the list of failed test and their error logs (I run them separately with create_test again after running full test suite and cleaning my scratch since some of them was failing with disk quota and to be sure). At this point, I don't think the errors caused by the changes in this PR. So, it seems it is safe to marge this PR but let me know what do you think?

DAE_N2_D_Lh12_Vnuopc.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_intel.clm-DA_multidrv

2022-04-05 23:35:23: Test 'DAE_N2_D_Lh12_Vnuopc.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_intel.clm-DA_multidrv' failed in phase 'CREATE_NEWCASE' with exception 'ERROR: _N option not supported by nuopc driver, use _C instead'
  File "/glade/scratch/turuncu/CESM_pr_279/cime/scripts/Tools/../../scripts/lib/CIME/test_scheduler.py", line 1080, in _run_catch_exceptions
    return run(test)
  File "/glade/scratch/turuncu/CESM_pr_279/cime/scripts/Tools/../../scripts/lib/CIME/test_scheduler.py", line 669, in _create_newcase_phase
    expect(False, "_N option not supported by nuopc driver, use _C instead")
  File "/glade/scratch/turuncu/CESM_pr_279/cime/scripts/Tools/../../scripts/lib/CIME/utils.py", line 163, in expect
    raise exc_type(msg)

 ---------------------------------------------------

ERP_D_Ln9_Vnuopc.C48_C48_mg17.QPC6.cheyenne_intel.cam-outfrq9s

Building test for ERP in directory /glade/scratch/turuncu/ERP_D_Ln9_Vnuopc.C48_C48_mg17.QPC6.cheyenne_intel.cam-outfrq9s.20220405_233500_pe3gnl
/glade/scratch/turuncu/CESM_pr_279/components/cam/src/dynamics/fv3/atmos_cubed_sphere/tools/fv_mp_mod.F90(75): error #6580: Name in only-list does not exist or is not accessible.   [MPP_NODE]

ERROR: BUILD FAIL: cam.buildlib failed, cat /glade/scratch/turuncu/ERP_D_Ln9_Vnuopc.C48_C48_mg17.QPC6.cheyenne_intel.cam-outfrq9s.20220405_233500_pe3gnl/bld/atm.bldlog.220406-004859

The details build log is in /glade/scratch/turuncu/ERP_D_Ln9_Vnuopc.C48_C48_mg17.QPC6.cheyenne_intel.cam-outfrq9s.20220406_153639_s2y85c/bld/atm.bldlog.220406-153959.

ERP_D_Ln9_Vnuopc.f09_f09_mg17.FSD.cheyenne_intel.cam-outfrq9s_contrail

81:MPT ERROR: Rank 81(g:81) received signal SIGFPE(8).
81:     Process ID: 58605, Host: r6i6n12, Program: /glade/scratch/turuncu/ERP_D_Ln9_Vnuopc.f09_f09_mg17.FSD.cheyenne_intel.cam-outfrq9s_contrail.20220406_154930_sfdkz3/bld/cesm.exe
81:     MPT Version: HPE MPT 2.22  03/31/20 15:59:10
81:
81:MPT: --------stack traceback-------
81:OMP: Warning #190: Forking a process while a parallel region is active is potentially unsafe.
46:MPT ERROR: Rank 46(g:46) received signal SIGFPE(8).
46:     Process ID: 21616, Host: r13i2n20, Program: /glade/scratch/turuncu/ERP_D_Ln9_Vnuopc.f09_f09_mg17.FSD.cheyenne_intel.cam-outfrq9s_contrail.20220406_154930_sfdkz3/bld/cesm.exe
46:     MPT Version: HPE MPT 2.22  03/31/20 15:59:10

I also run this by activating ESMF PET log but there is no any error in there. So, this requires further investigation.

SMS_D_Ln9_Vnuopc.ne0CONUSne30x8_ne0CONUSne30x8_mt12.FCnudged.cheyenne_intel.cam-outfrq9s_refined_camchem

1908:MPT: #1  0x00002b37033d5306 in mpi_sgi_system (
1908:MPT: #2  MPI_SGI_stacktraceback (
1908:MPT:     header=header@entry=0x7fff74fe8c50 "MPT ERROR: Rank 1908(g:1908) received signal SIGFPE(8).\n\tProcess ID: 7399, Host: r7i4n30, Program: /glade/scratch/turuncu/SMS_D_Ln9_Vnuopc.ne0CONUSne30x8_ne0CONUSne30x8_mt12.FCnudged.cheyenne_intel.ca"...) at sig.c:340
1908:MPT: #3  0x00002b37033d54ff in first_arriver_handler (signo=signo@entry=8,
1908:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2b3712d00080) at sig.c:489
1899:MPT: #4  0x00002ac3b632d793 in slave_sig_handler (signo=8, siginfo=<optimized out>,
1899:MPT:     extra=<optimized out>) at sig.c:565
1899:MPT: #5  <signal handler called>
1899:MPT: #6  0x00000000011d1e78 in physconst::get_hydrostatic_energy (i0=1, i1=16,
1899:MPT:     j0=1, j1=1, nlev=32, ntrac=200,
1899:MPT:     tracer=<error reading variable: value requires 819200 bytes, which is more than max-value-size>, pdel=..., cp_or_cv=..., u=..., v=..., t=..., vcoord=0,
1899:MPT:     ps=..., phis=..., z=...,
1899:MPT:     dycore_idx=<error reading variable: Cannot access memory at address 0x0>,
1899:MPT:     te=..., se=<error reading variable: Cannot access memory at address 0x0>,
1899:MPT:     ke=<error reading variable: Cannot access memory at address 0x0>,
1899:MPT:     wv=<error reading variable: Cannot access memory at address 0x0>, h2o=...,
1899:MPT:     liq=<error reading variable: Cannot access memory at address 0x0>, ice=...)
1899:MPT:     at /glade/scratch/turuncu/CESM_pr_279/components/cam/src/utils/physconst.F90:1244
1899:MPT: #7  0x0000000002c385d4 in check_energy::check_energy_timestep_init (state=...,
1899:MPT:     tend=..., pbuf=0x2ae3b7a49f80,
1899:MPT:     col_type=<error reading variable: Cannot access memory at address 0x0>)
1899:MPT:     at /glade/scratch/turuncu/CESM_pr_279/components/cam/src/physics/cam/check_energy.F90:254
1899:MPT: #8  0x0000000003242f04 in dp_coupling::derived_phys_dry (phys_state=...,
1899:MPT:     phys_tend=..., pbuf2d=0x2ae3b7a49f80)
1899:MPT:     at /glade/scratch/turuncu/CESM_pr_279/components/cam/src/dynamics/se/dp_coupling.F90:700
1899:MPT: #9  0x00000000031f77b2 in dp_coupling::d_p_coupling (phys_state=...,
1899:MPT:     phys_tend=..., pbuf2d=0x2ae3b7a49f80, dyn_out=...)
1899:MPT:     at /glade/scratch/turuncu/CESM_pr_279/components/cam/src/dynamics/se/dp_coupling.F90:289
1899:MPT: #10 0x0000000002483a52 in stepon::stepon_run1 (dtime_out=225, phys_state=...,
1899:MPT:     phys_tend=..., pbuf2d=0x2ae3b7a49f80, dyn_in=..., dyn_out=...)
1899:MPT:     at /glade/scratch/turuncu/CESM_pr_279/components/cam/src/dynamics/se/stepon.F90:110
1899:MPT: #11 0x0000000000a209eb in cam_comp::cam_run1 (
1899:MPT:     cam_in=<error reading variable: value requires 147400 bytes, which is more than max-value-size>,
1899:MPT:     cam_out=<error reading variable: value requires 151800 bytes, which is more than max-value-size>)
1899:MPT:     at /glade/scratch/turuncu/CESM_pr_279/components/cam/src/control/cam_comp.F90:243
1899:MPT: #12 0x00000000009d38fc in atm_comp_nuopc::datainitialize (gcomp=..., rc=0)
1899:MPT:     at /glade/scratch/turuncu/CESM_pr_279/components/cam/src/cpl/nuopc/atm_comp_nuopc.F90:873
1899:MPT: #13 0x00002ac3b00a9432 in ESMCI::MethodElement::execute(void*, int*) const ()
1899:MPT:     at /glade/p/cesmdata/cseg/PROGS/build/28560/esmf-8.2.0b23/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
1899:MPT: #14 0x00002ac3b00aa896 in ESMCI::MethodTable::execute (this=0x17541d20,
1899:MPT:     labelArg=..., object=0x1753f020, userRc=0x7ffee57be498,
1899:MPT:     existflag=0x7ffee57be222)
1899:MPT:     at /glade/p/cesmdata/cseg/PROGS/build/28560/esmf-8.2.0b23/src/Superstructure/Component/src/ESMCI_MethodTable.C:563

The full log can be seen in /glade/scratch/turuncu/SMS_D_Ln9_Vnuopc.ne0CONUSne30x8_ne0CONUSne30x8_mt12.FCnudged.cheyenne_intel.cam-outfrq9s_refined_camchem.20220406_162722_xutg84/run/cesm.log.3676318.chadmin1.ib0.cheyenne.ucar.edu.220406-192524

uturuncoglu commented 2 years ago

@mvertens Let me know if you want me to do more test? How do you want to proceed with this PR?

mvertens commented 2 years ago

@fischer-ncar @jedwards4b - are these expected fails for beta08? I think its fine to proceed with accepting and merging these PRs - but wanted to verify this first.

fischer-ncar commented 2 years ago

For cesm2_3_beta08 These two tests passed. DAE_N2_D_Lh12_Vnuopc.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_intel.clm-DA_multidrv SMS_D_Ln9_Vnuopc.ne0CONUSne30x8_ne0CONUSne30x8_mt12.FCnudged.cheyenne_intel.cam-outfrq9s_refined_camchem

These two tests failed. ERP_D_Ln9_Vnuopc.C48_C48_mg17.QPC6.cheyenne_intel.cam-outfrq9s ERP_D_Ln9_Vnuopc.f09_f09_mg17.FSD.cheyenne_intel.cam-outfrq9s_contrail

DeniseWorthen commented 2 years ago

@mvertens Please don't merge. Adding comp_present conditionals appears to resolve the issue w/ the ATM-WAV configuration, so I may want to make further changes to this PR branch.

DeniseWorthen commented 2 years ago

@mvertens This is ready for any final testing on your end. The additional checks for the presence of components allows me to run ATM-WAV only coupling for UWM. I ran all tests for UWM and all baselines passed.

mvertens commented 2 years ago

@uturuncoglu - are you comfortable with my merging this PR?

uturuncoglu commented 2 years ago

@mvertens It looks fine to me since those errors was not related with the PR but need to be investigated in the near future (not expected ones).

mvertens commented 2 years ago

@uturuncoglu - thank you. Actually - those failures are not errors - but newly asked output from the mediator to the wav. I ran these differences by @alperaltuntas today - and we are both comfortable with these new export answers.

ESCOMP / CMEPS