Community Terrestrial Systems Model (includes the Community Land Model of CESM)
301 stars 306 forks source link

Failures in waccmx_offline test #550

Closed billsacks closed 4 years ago

billsacks commented 5 years ago

Brief summary of bug

On release-clm5.0, ERS_D_Ln9_P480x3.f19_g16.I2000Clm50SpGs.cheyenne_intel.clm-waccmx_offline is failing the restart test due to some ROF fields (both rof2lnd and lnd2rof).

General bug information

CTSM version you are using: release-clm5.0.09-43-gde4a134 (but it looks like the same error occurred in @ekluzek 's testing of release-clm5.0.09)

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: Tests with the waccmx_offline testmod

Details of bug

I think the problem is that this test restarts in the middle of a ROF coupling interval, since ROF is set to couple every 3 hours.

I tried increasing the length of this test to 5 hours, to get a restart at 3 hours. However, this failed with a water balance error at time step 16.

Important output or errors that show the problem

Failure in ERS_D_Ln9_P480x3.f19_g16.I2000Clm50SpGs.cheyenne_intel.clm-waccmx_offline

These are the fields that differ between the base and restart runs, from cpl.hi:

 RMS x2l_Flrr_volr                    5.1452E-03            NORMALIZED  1.7545E+01
 RMS x2l_Flrr_volrmch                 5.0652E-03            NORMALIZED  3.9215E+01
 RMS r2x_Forr_rofl                    7.6880E-05            NORMALIZED  7.3214E+01
 RMS r2x_Forr_rofi                    1.4318E-03            NORMALIZED  2.5776E+02
 RMS r2x_Flrr_volr                    3.7171E-03            NORMALIZED  3.0539E+01
 RMS r2x_Flrr_volrmch                 3.6666E-03            NORMALIZED  6.8436E+01
 RMS x2r_Flrl_rofsur                  1.4240E-05            NORMALIZED  2.2272E+01
 RMS x2r_Flrl_rofgwl                  3.7447E-04            NORMALIZED  6.6007E+01
 RMS x2r_Flrl_rofsub                  5.3266E-05            NORMALIZED  8.4397E+00
 RMS x2r_Flrl_rofi                    3.7446E-04            NORMALIZED  6.6517E+01

Failure in ERS_D_Lh5_P480x3.f19_g16.I2000Clm50SpGs.cheyenne_intel.clm-waccmx_offline

381: WARNING:  water balance error  nstep=           16  local indexc=        26329
381:  errh2o=   1.333529328132217E-003
381: clm model is stopping - error is greater than 1e-5 (mm)
381: nstep                 =           16
381: errh2o                =   1.333529328132217E-003
381: forc_rain             =   0.000000000000000E+000
381: forc_snow             =   2.621327578972475E-003
381: total_plant_stored_h2o_col =   0.000000000000000E+000
381: endwb                 =    2939.21392352162
381: begwb                 =    2939.21403792467
381: qflx_evap_tot         =  -1.890526273460123E-005
381: qflx_irrig            =   0.000000000000000E+000
381: qflx_surf             =   0.000000000000000E+000
381: qflx_h2osfc_surf      =   0.000000000000000E+000
381: qflx_qrgwl            =   0.000000000000000E+000
381: qflx_drain            =   4.088165222360376E-003
381: qflx_drain_perched    =   0.000000000000000E+000
381: qflx_flood            =   0.000000000000000E+000
381: qflx_ice_runoff_snwcp =   0.000000000000000E+000
381: qflx_ice_runoff_xs    =   0.000000000000000E+000
381: qflx_glcice_dyn_water_flux =   0.000000000000000E+000
381: qflx_snwcp_discarded_ice =   0.000000000000000E+000
381: qflx_snwcp_discarded_liq =   0.000000000000000E+000
381: qflx_rootsoi_col(1:nlevsoil)  =   0.000000000000000E+000
381:  1.421355802127227E-009  3.793750179443353E-009  9.152129867992451E-010
381:  4.943154857210362E-009 -1.481497918064600E-006  8.185368947298590E-007
381:  5.760884802171176E-007  7.720118732749495E-008  2.137550711239529E-009
381: -2.382258233445312E-009 -5.645953750444636E-010 -7.672929666410538E-011
381: -4.088472451680095E-011 -5.456406648993166E-011 -7.223628298214421E-011
381: -9.247471395432471E-011 -1.152682409312011E-010 -1.406578126772176E-010
381:  0.000000000000000E+000
381: clm model is stopping
381: calling getglobalwrite with decomp_index=        26329  and clmlevel= column
381: local  column   index =        26329
381: ERROR: get_proc_bounds ERROR: Calling from inside  a threaded region

Note that, before this, there were water balance warnings up through time step 3, but then nothing between time steps 3 and 16.

billsacks commented 5 years ago

There are really two different issues here, which perhaps should be split and dealt with separately.

@ekluzek I'm not sure what you intended in terms of the restart time for this test. i.e., did you deliberately want a restart very shortly into the test? In that case, a solution could be to change the ROF coupling for this test to be the same as ATM_NCPL. But we should still determine the cause of the water balance error.

ekluzek commented 5 years ago

Fixed in release-clm5.0.12

billsacks commented 4 years ago

Reopening: It looks like this was fixed on the release branch (in release-clm5.0.12) but the fix hasn't made it to master.

billsacks commented 4 years ago

This fix has come to master: ERS_D_Ln9_P480x3.f19_g16.I2000Clm50SpGs.cheyenne_intel.clm-waccmx_offline has been replaced with ERS_D_Ld5_P480x3.f19_g16.I2000Clm50SpGs.cheyenne_intel.clm-waccmx_offline.