ESCOMP / CAM

Community Atmosphere Model
71 stars 133 forks source link

Runs with MPAS-A dycore and CAM7 physics fail - missing variables in inic files #995

Open gdicker1 opened 5 months ago

gdicker1 commented 5 months ago

What happened?

Runs of the F2000dev compset on MPAS-A grids fail. This seems to be due to the combination of the MPAS-A dycore and CAM7 (a.k.a. cam_dev) physics.

The last output from a case's atm.log:

  ----- done assigning dimensions from Registry.xml -----

 Allocating fields ...
  34 MB allocated for fields on this task
  4346 MB total allocated for fields across all tasks
  ----- done allocating fields -----

Last output from cesm.log (reorganized for 1 thread):

dec0360.hsn.de.hpc.ucar.edu 124: forrtl: severe (174): SIGSEGV, segmentation fault occurred
dec0360.hsn.de.hpc.ucar.edu 124: Image              PC                Routine            Line        Source
dec0360.hsn.de.hpc.ucar.edu 124: libpthread-2.31.s  000014BDC4E318C0  Unknown               Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: cesm.exe           0000000002CAE620  mpas_io_streams_m        1037  mpas_io_streams.F
dec0360.hsn.de.hpc.ucar.edu 124: cesm.exe           0000000002B40B6D  cam_mpas_subdrive        1154  cam_mpas_subdriver.F90
dec0360.hsn.de.hpc.ucar.edu 124: cesm.exe           0000000000643D5E  dyn_grid_mp_dyn_g         464  dyn_grid.F90
dec0360.hsn.de.hpc.ucar.edu 124: cesm.exe           0000000000592015  cam_comp_mp_cam_i         165  cam_comp.F90
dec0360.hsn.de.hpc.ucar.edu 124: cesm.exe           000000000057ACDD  atm_comp_nuopc_mp         635  atm_comp_nuopc.F90
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCC973B40  _ZN5ESMCI6FTable1     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCC973607  ESMCI_FTableCallE     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCCC5DF85  _ZN5ESMCI2VM5ente     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCC974351  c_esmc_ftablecall     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCCEEE6E0  esmf_compmod_mp_e     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCD22F851  esmf_gridcompmod_     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCD60C9E0  nuopc_driver_mp_l     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCD629055  nuopc_driver_mp_i     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCC973B40  _ZN5ESMCI6FTable1     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCC973607  ESMCI_FTableCallE     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCCC5DF85  _ZN5ESMCI2VM5ente     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCC974351  c_esmc_ftablecall     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCCEEE6E0  esmf_compmod_mp_e     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCD22F851  esmf_gridcompmod_     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCD60C9E0  nuopc_driver_mp_l     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCD628F3F  nuopc_driver_mp_i     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCD63DD80  nuopc_driver_mp_i     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCC973B40  _ZN5ESMCI6FTable1     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCC973607  ESMCI_FTableCallE     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCCC5DF85  _ZN5ESMCI2VM5ente     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCC974351  c_esmc_ftablecall     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124: libesmf.so         000014BDCCEEE6E0  esmf_compmod_mp_e     Unknown  Unknown
dec0360.hsn.de.hpc.ucar.edu 124:
dec0360.hsn.de.hpc.ucar.edu 124: Stack trace terminated abnormally.

What are the steps to reproduce the bug?

The easiest is to create a case with --compset F2000dev to get cam_dev physics and --res mpasa120_mpasa120 to get the MPAS-A dycore. After setting up, building, and submitting the case the run will fail.

E.g. on Derecho:

./cime/scripts/create_newcase --case "${CASENAME}" --project "${PROJ}" --run-unsupported --compiler intel --res mpasa120_mpasa120 --compset F2000dev

What CAM tag were you using?

cam6_3_148

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/derecho/scratch/gdicker/F2000dev_mpasa120_intel_1710435350

Will you be addressing this bug yourself?

Any CAM SE can do this

Extra info

No response

adamrher commented 5 months ago

Can you confirm whether this occurs with F2000climo a.k.a. CAM6 physics?

Are these runs with ./xmlchange DEBUG=TRUE?

Thanks.

gdicker1 commented 5 months ago

Hi @adamrher, I can confirm that F2000climo works. I was testing the RRTMGP changes in CAM with MPAS-A, and I was able to run with F2000climo.

I have not tried with DEBUG=TRUE yet. I will update when I do.

gdicker1 commented 5 months ago

Here's one thread's content in cesm.log from a run with DEBUG=true

dec0314.hsn.de.hpc.ucar.edu 2:  ERROR:
dec0314.hsn.de.hpc.ucar.edu 2:  cam_mpas_subdriver::cam_mpas_read_static: FATAL: Failed to add 2 fields to stat
dec0314.hsn.de.hpc.ucar.edu 2:  ic input stream.
dec0314.hsn.de.hpc.ucar.edu 2: Image              PC                Routine            Line        Source
dec0314.hsn.de.hpc.ucar.edu 2: cesm.exe           000000000A913110  shr_abort_mod_mp_         114  shr_abort_mod.F90
dec0314.hsn.de.hpc.ucar.edu 2: cesm.exe           000000000A912F7A  shr_abort_mod_mp_          61  shr_abort_mod.F90
dec0314.hsn.de.hpc.ucar.edu 2: cesm.exe           0000000009DF56A2  cam_mpas_subdrive        1161  cam_mpas_subdriver.F90
dec0314.hsn.de.hpc.ucar.edu 2: cesm.exe           0000000000CE1FFF  dyn_grid_mp_setup         464  dyn_grid.F90
dec0314.hsn.de.hpc.ucar.edu 2: cesm.exe           0000000000CDC9B0  dyn_grid_mp_dyn_g         138  dyn_grid.F90
dec0314.hsn.de.hpc.ucar.edu 2: cesm.exe           0000000000957350  cam_comp_mp_cam_i         165  cam_comp.F90
dec0314.hsn.de.hpc.ucar.edu 2: cesm.exe           00000000008FEED9  atm_comp_nuopc_mp         635  atm_comp_nuopc.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BF90DDA9  callVFuncPtr             2167  ESMCI_FTable.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BF90CDE8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BFD9DB72  enter                    2318  ESMCI_VMKernel.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BFD87010  enter                    1216  ESMCI_VM.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BF90E18F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4C03ED650  esmf_compmod_mp_e        1223  ESMF_Comp.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4C0D7B8E5  esmf_gridcompmod_        1412  ESMF_GridComp.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4C1821DFC  nuopc_driver_mp_l        2889  NUOPC_Driver.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4C180A69F  nuopc_driver_mp_i        1992  NUOPC_Driver.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BF90DDA9  callVFuncPtr             2167  ESMCI_FTable.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BF90CDE8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BFD9DB72  enter                    2318  ESMCI_VMKernel.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BFD87010  enter                    1216  ESMCI_VM.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BF90E18F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4C03ED650  esmf_compmod_mp_e        1223  ESMF_Comp.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4C0D7B8E5  esmf_gridcompmod_        1412  ESMF_GridComp.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4C1821DFC  nuopc_driver_mp_l        2889  NUOPC_Driver.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4C180A44C  nuopc_driver_mp_i        1987  NUOPC_Driver.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4C17CF051  nuopc_driver_mp_i         487  NUOPC_Driver.F90
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BF90DDA9  callVFuncPtr             2167  ESMCI_FTable.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BF90CDE8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BFD9DB72  enter                    2318  ESMCI_VMKernel.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BFD87010  enter                    1216  ESMCI_VM.C
dec0314.hsn.de.hpc.ucar.edu 2: libesmf.so         000014C4BF90E18F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec0314.hsn.de.hpc.ucar.edu 2:
dec0314.hsn.de.hpc.ucar.edu 2: Stack trace terminated abnormally.
dec0314.hsn.de.hpc.ucar.edu 2: MPICH ERROR [Rank 2] [job id 5d63df0c-2c01-4c32-88d0-b8a50fe5fa22] [Thu Mar 14 11:26:20 2024] [dec0314] - Abort(1001) (rank 2  in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 2
dec0314.hsn.de.hpc.ucar.edu 2:
dec0314.hsn.de.hpc.ucar.edu 2: aborting job:
dec0314.hsn.de.hpc.ucar.edu 2: application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 2

From a run on Derecho within "/glade/derecho/scratch/gdicker/F2000dev_mpasa120_intel_dbg_1710436541"

briandobbins commented 5 months ago

Is this just a problem with the IC file? I've run this with my own analytic IC files and cam_dev physics before. I think it just needs those two missing fields (cell_gradient_coef_x and cell_gradient_coef_y).

mgduda commented 5 months ago

As a temporary workaround, if testing without the frontogenesis gravity wave drag (?) scheme is acceptable, setting use_gw_front = false in CAM's namelist might suffice. It looks like the cell_gradient_coef_x and cell_gradient_coef_y fields are only read if use_gw_front or use_gw_front_igw are true: https://github.com/ESCOMP/CAM/blob/cam6_3_148/src/dynamics/mpas/driver/cam_mpas_subdriver.F90#L1152-L1162 .

gdicker1 commented 5 months ago

Thanks @briandobbins and @mgduda for the tips.

Is this just a problem with the IC file?

It might be. I think only "atm/cam/inic/mpas/mpasa60_L32_notopo_coords_c230707.nc" has cell_gradient_coef_{xy} variables from what I checked.

... setting use_gw_front = false in CAM's namelist might suffice....

I just tried a couple of these F2000dev MPAS-A runs with use_gw_front = .false. added to user_nl_cam, and they succeeded!

adamrher commented 5 months ago

As a temporary workaround, if testing without the frontogenesis gravity wave drag (?) scheme is acceptable

This was off in CAM6, so it's not terrible to omit this process in the near term. But this should get fixed for production runs as our midlatitude jets and polar vortex are too strong, and so the additional drag caused by turning the frontal scheme on does move the solution in the right direction.

This is less important at higher resolutions where these waves start to become resolved.

adamrher commented 5 months ago

@gdicker1 if this issue is just due to missing variables in the inic file when running the frontal scheme, should we close (or rename) this issue?

gdicker1 commented 5 months ago

If the issue isn't fixed, I'm not sure why it should be closed. Unless someone has regenerated the files already?

@adamrher I think the issue title was fine but I changed it to "Runs with MPAS-A dycore and CAM7 physics fail - missing variables in inic files." If that still isn't what you imagined, I don't mind if the title changes again.

adamrher commented 5 months ago

@gdicker1 understood. You're right, the original name still conveyed this issue. I was just confused since folks have been running cam_dev with MPAS for a while now, but the issue is that our namelist_defaults have a large number of inic without the variables req'd to run cam_dev.

adamrher commented 3 weeks ago

Hi @gdicker1. I was looking through the issues and we don't have a general issue for bringing in L58/L93 support for mpas. This issue here is related, but not encompassing of the entire effort, which now includes this issue: https://github.com/ESCOMP/CAM/issues/1102. I was going to open the issue but wanted to check with you first.

Only mpasa120 and mpasa480 are supported in cam_development. So I was thinking the issue could just provide support for those two grids -- hi-res and var-res can be a separate issue that we can address after supporting the coarser grids. Thoughts?

gdicker1 commented 3 weeks ago

Hi @adamrher, thanks for checking. I think this sounds reasonable, especially to add other resolutions later.

Just to add some other thoughts: Other times this has come up there wasn't agreement on what the level heights should be for L58 and L93 (but I think this has been resolved). There has also been concerns about the amount of space the (high-resolution) files could take up on CESM data servers, especially since we could have with 3 versions of a similar grid (notopo, topo, and real-data).

briandobbins commented 3 weeks ago

Short term, let's get all the 120km cases done - space isn't much of a concern there, and since it's the workhorse resolution, and the one likely to be 'tested' the most, the value of having things work out of the box is big.

Longer term, for high-resolution cases, I've got some discussions going on with CISL about moving our input storage (and merging the EarthWorks & CESM datasets) on to new infrastructure that's got more, and scalable, space.

Cheers,

On Wed, Jul 24, 2024 at 2:04 PM G. Dylan Dickerson @.***> wrote:

Hi @adamrher https://github.com/adamrher, thanks for checking. I think this sounds reasonable, especially to add other resolutions later.

Just to add some other thoughts: Other times this has come up there wasn't agreement on what the level heights should be for L58 and L93 (but I think this has been resolved). There has also been concerns about the amount of space the (high-resolution) files could take up on CESM data servers, especially since we could have with 3 versions of a similar grid (notopo, topo, and real-data).

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CAM/issues/995#issuecomment-2248804895, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACL2HPNFD7GLDL4MWBWDCY3ZOACGVAVCNFSM6AAAAABEWP5RJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYHAYDIOBZGU . You are receiving this because you were mentioned.Message ID: @.***>