ESCOMP / CMEPS

NUOPC Community Mediator for Earth Prediction Systems
https://escomp.github.io/CMEPS/
24 stars 79 forks source link

Possible issue with exchange grid?: Zero values sent to atm in grid cells with ifrac = 1 #510

Closed billsacks closed 1 month ago

billsacks commented 1 month ago

I should say up-front that I'm not sure if this is actually a bug with aoflux_grid=xgrid, but it is a difference between aoflux_grid=xgrid and aoflux_grid=ogrid, and it seems like it might be contributing to a model crash. So I'm opening this to start a discussion on whether this might be a bug.

I'll start with the conclusion before diving into details: It seems like, with aoflux_grid=ogrid, we get 0 values sent to atm from the atm-ocn flux calculations over land points, but NOT over 100% ice points; but with aoflux_grid=xgrid, we get 0 values sent to atm from the atm-ocn flux calculations over any points with (lnd+ice)=1 (i.e., any points with ofrac=0). This MAY be contributing to a crash in some configurations with xgrid, though I'm not sure yet if that's the cause. @jedwards4b @mvertens @uturuncoglu @DeniseWorthen - do any of you have a sense of what's right here?

This started with an investigation of a failure in the test SMS_D_Ln9.f09_f09_mg17.FCnudged_GC.derecho_intel.cam-outfrq9s. I'm running this from cesm3_0_alpha03c but with CMEPS updated to cmeps1.0.18 (which is needed to fix some other issues with the exchange grid). Out-of-the-box, this test passes. However, when adding the following to user_nl_cpl – aoflux_grid = "xgrid" – this test fails with a divide by zero in drydep_mod:

dec2455.hsn.de.hpc.ucar.edu 484: forrtl: error (73): floating divide by zero
dec2455.hsn.de.hpc.ucar.edu 484: Image              PC                Routine            Line        Source
dec2455.hsn.de.hpc.ucar.edu 484: libpthread-2.31.s  0000149FBD5108C0  Unknown               Unknown  Unknown
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           000000000315AECC  drydep_mod_mp_adu        4108  drydep_mod.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           000000000313FCB0  drydep_mod_mp_dep        1774  drydep_mod.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           000000000311CE7A  drydep_mod_mp_do_         316  drydep_mod.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           0000000002C89ECB  chemistry_mp_chem        3514  chemistry.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           00000000012ACFAC  physpkg_mp_tphysa        1604  physpkg.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           00000000012A7F77  physpkg_mp_phys_r        1284  physpkg.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           00000000009A9FA1  cam_comp_mp_cam_r         290  cam_comp.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           0000000000959FFF  atm_comp_nuopc_mp        1136  atm_comp_nuopc.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533D86  execute                   377  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533932  execute                   563  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC653352A  c_esmc_methodtabl         317  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC68D388B  esmf_attachmethod        1287  ESMF_AttachMethods.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC80493FD  Unknown               Unknown  Unknown
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F279  callVFuncPtr             2167  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618E2B8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC662BAB2  enter                    2501  ESMCI_VMKernel.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6614346  enter                    1216  ESMCI_VM.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F65F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6C134FC  esmf_compmod_mp_e        1252  ESMF_Comp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC74E3D6A  esmf_gridcompmod_        1903  ESMF_GridComp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F85B75  nuopc_driver_mp_r        3694  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F8BDFA  nuopc_driver_mp_e        3940  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533D86  execute                   377  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533932  execute                   563  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC653352A  c_esmc_methodtabl         317  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC68D388B  esmf_attachmethod        1287  ESMF_AttachMethods.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F83B76  nuopc_driver_mp_r        3615  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F279  callVFuncPtr             2167  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618E2B8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC662BAB2  enter                    2501  ESMCI_VMKernel.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6614346  enter                    1216  ESMCI_VM.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F65F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6C134FC  esmf_compmod_mp_e        1252  ESMF_Comp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC74E3D6A  esmf_gridcompmod_        1903  ESMF_GridComp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F85B75  nuopc_driver_mp_r        3694  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F8BDFA  nuopc_driver_mp_e        3940  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533D86  execute                   377  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533932  execute                   563  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC653352A  c_esmc_methodtabl         317  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC68D388B  esmf_attachmethod        1287  ESMF_AttachMethods.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F83B76  nuopc_driver_mp_r        3615  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F279  callVFuncPtr             2167  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618E2B8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC662BAB2  enter                    2501  ESMCI_VMKernel.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6614346  enter                    1216  ESMCI_VM.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F65F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6C134FC  esmf_compmod_mp_e        1252  ESMF_Comp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC74E3D6A  esmf_gridcompmod_        1903  ESMF_GridComp.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           000000000044E467  MAIN__                    141  esmApp.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           0000000000425D7D  Unknown               Unknown  Unknown
dec2455.hsn.de.hpc.ucar.edu 484: libc-2.31.so       0000149FB8E0229D  __libc_start_main     Unknown  Unknown
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           0000000000425CAA  Unknown               Unknown  Unknown

which is here:

    ! surface resistance for particle
    RS   = 1.e0_f8 / (E0 * USTAR * (EB + EIM + EIN) * R1 )

As a side-note: This test also fails in non-debug mode, though more cryptically (so I'm not positive what's going on here):

dec0891.hsn.de.hpc.ucar.edu 12: MPICH ERROR [Rank 12] [job id 8a79c3d6-fcc1-4b88-bf53-101d7de9bc46] [Mon Sep 30 15:57:48 2024] [dec0891] - Abort(3218063) (rank 12 in comm 0): Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
dec0891.hsn.de.hpc.ucar.edu 12: PMPI_Alltoallv(386)............: MPI_Alltoallv(sbuf=0x14ac2a7e3800, scnts=0x14ac2dcab580, sdispls=0x14ac2dcaab00, dtype=0x4c000829, rbuf=0x14ac2afca840, rcnts=0x14ac2dcaa080, rdispls=0x14ac2dca9600, datatype=dtype=0x4c000829, comm=comm=0xc400000f) failed
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_CRAY_Alltoallv(1187)......:
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_Waitall(167)..............:
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_Waitall_impl(51)..........:
dec0891.hsn.de.hpc.ucar.edu 12: MPID_Progress_wait(201)........:
dec0891.hsn.de.hpc.ucar.edu 12: MPIDI_Progress_test(97)........:
dec0891.hsn.de.hpc.ucar.edu 12: MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)
dec0891.hsn.de.hpc.ucar.edu 12:
dec0891.hsn.de.hpc.ucar.edu 12: aborting job:
dec0891.hsn.de.hpc.ucar.edu 12: Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
dec0891.hsn.de.hpc.ucar.edu 12: PMPI_Alltoallv(386)............: MPI_Alltoallv(sbuf=0x14ac2a7e3800, scnts=0x14ac2dcab580, sdispls=0x14ac2dcaab00, dtype=0x4c000829, rbuf=0x14ac2afca840, rcnts=0x14ac2dcaa080, rdispls=0x14ac2dca9600, datatype=dtype=0x4c000829, comm=comm=0xc400000f) failed
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_CRAY_Alltoallv(1187)......:
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_Waitall(167)..............:
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_Waitall_impl(51)..........:
dec0891.hsn.de.hpc.ucar.edu 12: MPID_Progress_wait(201)........:
dec0891.hsn.de.hpc.ucar.edu 12: MPIDI_Progress_test(97)........:
dec0891.hsn.de.hpc.ucar.edu 12: MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)

This test passes with aoflux_grid = "ogrid", and a similar test without chemistry – SMS_D_Ln9.f09_f09_mg17.FHIST.derecho_intel.cam-outfrq9s – passes even with xgrid.

I haven't dug deeply into the relevant CAM code, but I decided to look for differences in variables sent to the atmosphere in the test SMS_D_Ln9.f09_f09_mg17.FHIST.derecho_intel.cam-outfrq9s with xgrid vs. ogrid. I started by looking at ustar, since that's one of the terms in the line with divide-by-zero (though note that I have not confirmed that this is the term causing the problem).

Here is ustar sent to atm using xgrid:

Pasted image 20240930175002

And here using ogrid:

Pasted image 20240930175021

Both have 0 values over land, but note that the xgrid run has additional 0 values in the Arctic ocean and near Antarctica. I see these same extra 0 values in Med_aoflux_atm_So_ustar, but not in Med_aoflux_ocn_So_ustar. I spot-checked some other fields from the atm-ocn flux calculation and see the same thing with other fields.

Here is a map showing where ofrac is essentially 0: Pasted image 20241001163531

And here are points where ifrac is essentially 1: Pasted image 20241001163604

By eye, these seem to match up very well with the grid cells that have 0 values in the run with aoflux_grid=xgrid. This leads me to the conclusion at the top of the issue.

billsacks commented 1 month ago

After some consultation with @mvertens and additional testing, we feel that xgrid is working as intended here. The 0 values over sea ice grid cells also appear in runs with aoflux_grid = "xgrid" but where the atmosphere and ocean are running on different grids. CAM crashes in the same way in SMS_D_Ln9.f09_g17.FCnudged_GC.derecho_intel.cam-outfrq9s with aoflux_grid = "ogrid" as I noted above with SMS_D_Ln9.f09_f09_mg17.FCnudged_GC.derecho_intel.cam-outfrq9s with "xgrid". I believe this is a CAM issue, unless CAM wants to push for changes in the long-standing behavior of the mediator in this respect (which could be a reasonable solution). So I have moved this to a CAM issue:

https://github.com/ESCOMP/CAM/issues/1172