ESCOMP / CAM

Community Atmosphere Model
71 stars 133 forks source link

ESMF regrid error in WACCM-X at 1 degree resolution on Derecho #902

Open npedatella opened 9 months ago

npedatella commented 9 months ago

What happened?

When running WACCM-X at 1 degree resolution on Derecho with CESM2.2 the model crashes due to an error in ESMF. The specific error in the CESM log file is:

edyn_esmf_update: error return from ESMF_FieldRegridStore for 3d mag2geo: rc= 6 ERROR: edyn_esmf_update: ESMF_FieldRegridStore for 3d mag2geo phi3d

The ESMF log file gives the following error: 20231010 151835.112 ERROR PET267 ESMF_FieldRegrid.F90:4329 ESMF_FieldRegridGetIwts Invalid argument - - can't currently regrid a grid that contains a DE of width less than 2 20231010 151835.113 ERROR PET267 ESMF_FieldRegrid.F90:3180 ESMF_FieldRegridStoreNX Invalid argument - Internal subroutine call returned Error 20231010 151835.113 ERROR PET267 ESMF_FieldRegrid.F90:1349 ESMF_FieldRegridStoreNX Invalid argument - Internal subroutine call returned Error 20231010 151835.113 ERROR PET267 ESMF_FieldRegrid.F90:974 ESMF_FieldRegridStoreNX Invalid argument - Internal subroutine call returned Error

What are the steps to reproduce the bug?

CESM2.2 case on Derecho with resolution f09_f09_mg17 and compset FXHIST

Example case is in /glade/derecho/scratch/nickp/tmp/test_wx_1deg/

What CAM tag were you using?

CESM2.2

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/derecho/scratch/nickp/tmp/test_wx_1deg/

Will you be addressing this bug yourself?

No

Extra info

No response

jedwards4b commented 9 months ago

@npedatella Can you try updating esmf to esmf/8.6.0b04 in cime/config/cesm/machines/config_machines.xml do a clean build and let me know if you still get the error.

npedatella commented 9 months ago

I updated esmf to esmf/8.6.0b04. I get a similar error, though the ESMF log file is slightly different (case /glade/derecho/scratch/nickp/tmp/test_wx_1deg.002):

20231011 112737.111 ERROR PET277 ESMF_FieldRegrid.F90:4404 checkGrid Invalid argument - some types of regridding (e.g. bilinear) are not supported on Grids that contain a DE of width 1. 20231011 112737.112 ERROR PET277 ESMF_FieldRegrid.F90:3191 b_or_p_GridToMesh Invalid argument - Internal subroutine call returned Error 20231011 112737.112 ERROR PET277 ESMF_FieldRegrid.F90:1350 getMeshWithNodesOnFieldLoc Invalid argument - Internal subroutine call returned Error 20231011 112737.112 ERROR PET277 ESMF_FieldRegrid.F90:976 ESMF_FieldRegridStoreNX Invalid argument - Internal subroutine call returned Error

oehmke commented 9 months ago

@npedatella the issue is that you're dividing an ESMF Grid finely enough across processors that you have less than 1 complete cell along some dimension on some DEs/processor. For some types of regridding ESMF has the constraint that it can't have part of a Grid cell on a DE. This often occurs when you're creating a Grid and just distributing it along one dimension (e.g. only dividing it along the longitude). Can you check if you're doing that by looking at the ESMF_GridCreate() call? If so, dividing it along both dimensions will help. (As a quick fix, running on fewer processors will help as well, but I'm not sure if you'd want to do that.)

(BTW, getting rid of this constraint is on my todo list, but I haven't had a chance to get to it yet.)

jedwards4b commented 9 months ago

@npedatella. - is there a strong reason to use the mct coupler? Can you try with the nuopc driver?

npedatella commented 9 months ago

@oehmke I tried running with fewer processors (128, i.e., one node) and still had the same problem.

npedatella commented 9 months ago

@jedwards4b I think that the mct coupler is the default setting which is why it is being used. I tried changing to nuopc (xmlchange COMP_INTERFACE=nuopc) and am unable to run the setup script or build the model.

jedwards4b commented 9 months ago

Yes I understand - at this point I am suggesting that you move to cesm2.3.x where this case works - unless you want to back port changes in cam to 2.2.

npedatella commented 9 months ago

@jedwards4b OK. Can you recommend a version that should be used going forward?

jedwards4b commented 9 months ago

cesm2_3_beta15

cacraigucar commented 9 months ago

@npedatella - Can this issue be closed as "do not fix"?

jedwards4b commented 9 months ago

I have a meeting later this morning to discuss with @fvitt

fvitt commented 9 months ago

@npedatella For your f09 case in CESM2.2 with 256 mpi tasks, try this namelist setting:

 npr_yz = 32,8,8,32

This divides the mag grid across fewer mpi tasks in the latitude direction.

In CESM2.2, which is before the regrid refactoring in waccmx, the decomposition of the mag and oplus grids used the FV dycore grid decomposition settings. In CESM, after regrid refactoring, the mag and oplus grids are no longer tied to the FV dycore grid decomposition,

npedatella commented 9 months ago

@fvitt The f09 case works with these namelist settings if I use 256 mpi tasks. However, when I setup a new case it defaults to 512 tasks and the settings do not work. Should the default settings be changed?

fvitt commented 9 months ago

@fvitt The f09 case works with these namelist settings if I use 256 mpi tasks. However, when I setup a new case it defaults to 512 tasks and the settings do not work. Should the default settings be changed?

For 512 tasks try: npr_yz = 32,16,16,32