Closed andrewgettelman closed 2 years ago
edsclrm is CLUBB's array of scalars diffused by CLUBB's eddy diffusivity. windm is CLUBB's representation of the horizontal wind components. ("m" in CLUBB-speak denotes "grid-mean." So "windm" refers to the grid-mean values of u and v.) I am guessing that the wind is the problem, not the eddy scalars, which are chemical species, etc.
However, if the model crashes within 10 time steps, even in aquaplanet mode, when it runs with 120-km resolution, then perhaps a variable is not initialized properly. Maybe running with floating point trapping turned on could catch the first NaN, which might lead us to an initialization error.
@vlarson The model works for CAM-MPAS at 120km resolution with ZM2 (replaced the zm_conv_intr.F90 and zm_conv.F90 as shared by Adam). However, that solution does not work for the 60km aquaplanet run (Bill tested here) or the 60-3km full-topography one (I did here) with the same error message.
Here are the first NaN values printed out in my cesm log file:
edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.
If setting the edsclrm values to zeros will cause other issues with model crashing at the same time step.
The code blocks are highlighted here by Adam for the functions windm_edsclrm_rhs, windm_edsclrm_lhs, and windm_edsclrm_solve and the clubb_at_least_debug_level:
Sorry if I missed it, but is the crash in a 3km column and is that column at or near orography?
edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.
Is edsclrm # 9, 10, 11, or 12 initialized to a reasonable value?
Turning on floating point trapping might catch an uninitialized variable if there is one.
Is the dynamics running at all? Or does this failure happen before the dynamics is ran?
@swrneale : we can get the crash in an aquaplanet model, so no land or topography. We have not located the point and level where it is happening. @xhuang-ncar will need some guidance on that (I don't really know how to pull out a column). Thanks!
Could you send the location of your log files?
On Mon, Oct 4, 2021 at 11:09 AM xhuang-ncar @.***> wrote:
The model works for CAM-MPAS at 120km resolution with ZM2 (replaced the zm_conv_intr.F90 and zm_conv.F90 as shared by Adam). However, that solution does not work for the 60km aquaplanet run (Bill tested here) or the 60-3km full-topography one (I did here) with the same error message.
Here are the first NaN values printed out in my cesm log file:
edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.
If setting the edsclrm values to zeros will cause other issues with model crashing at the same time step.
The code blocks are highlighted here by Adam for the functions windm_edsclrm_rhs, windm_edsclrm_lhs, and windm_edsclrm_solve and the clubb_at_least_debug_level: https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L464-L524
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CAM/issues/442#issuecomment-933683182, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGLMTWOMEDFGYQFEXVK2PDUFHNUPANCNFSM5FFNKEQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@JulioTBacmeister Sure. The log file is here: https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.660137.chadmin1.ib0.cheyenne.ucar.edu.210925-034021
Is the dynamics running at all? Or does this failure happen before the dynamics is ran?
Yes, it was running. It crashed after 16 time steps. Also, I am using 120s as the dtime given the refined 3km resolution.
edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.Is edsclrm # 9, 10, 11, or 12 initialized to a reasonable value?
Turning on floating point trapping might catch an uninitialized variable if there is one.
Not sure about that. How should that be initialized normally? Let me turn on the debug mode and also to print out the input variables in the windm_edsclrm_solve function to the log file.
Thanks Xingying, This is a question for everyone. The log file shows error messages that clearly originate in Adam's blocked out portions of 'advance_windm_edsclrm', yet the messages in the log file don't advance through the entire if ( err_code == clubb_fatal_error ) then ... endif block. They are always interrupted by what appears to a new entry to the error block, e.g.:
776: up2 = 0.351420540093032 0.297728678007049 0.266079981897943 776: 0.247924301489710 0.234454106712590 0.222438014461757 776: 0.210447703011471 0.199152298849457 0.192144528317254 776: 0.195549715651495 0.214431704931684 0.244532312820753 800: Fatal error solving for eddsclrm 800: Error in advance_windm_edsclrm 800: Intent(in) 800: dt = 40.0000000000000 800: wm_zt = 0.000000000000000E+000 9.109764803756008E-005 800: 3.344162123594033E-004 6.675431334319516E-004 1.065847183270905E-003
So here as up2 is being written out suddenly a new error stream starts. Does this mean the code is failing on multiple processes at the same time? If this is so, can the error messages be forced to go through the entire block so that we get a clear idea of the profiles going into advance_windm_edslcrm?
On Mon, Oct 4, 2021 at 12:31 PM xhuang-ncar @.***> wrote:
@JulioTBacmeister https://github.com/JulioTBacmeister Sure. The log file is here: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/run/cesm.log.660137.chadmin1.ib0.cheyenne.ucar.edu.210925-034021
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CAM/issues/442#issuecomment-933745862, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGLMTVVL6LGVIQUQLABR2LUFHXIRANCNFSM5FFNKEQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@JulioTBacmeister Looking through the log, I am only seeing NaN's for processor 776. However, I am finding that the entire write statement code block is being written out for 776, because its last entries in the cesm.log are:
776: wpedsclrp = 0.000000000000000E+000 1.038482017559906E-005 776: 9.159381691910688E-006 7.466857448689597E-006 7.067816049197199E-006 776: 8.285625416776367E-006 1.226171921624444E-005 1.963394231215161E-005 776: 3.335552805102376E-005 4.535006155606812E-005 4.986023016470998E-005 776: 5.817265387986108E-005 6.110989684521623E-005 2.089284626521818E-005 776: 2.964944981635684E-006 2.922172823855216E-007 2.468995295154835E-008 776: 1.433156461244229E-008 3.468759159148850E-008 2.422783580467906E-008 ...
Where wpedsclrp
is the last entry in that code block:
https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L555-L593
But 776 isn't the only processor with "Error in advance_windm_edsclrm," but again, 776 is the only one with NaNs.
If it were me, I'd probably write out all the input/output variables to the subroutine calls windm_edsclrm_rhs
, windm_edsclrm_lhs
, windm_edsclrm_solve
. I'd first try conditioning these write statements on if ( err_code == clubb_fatal_error ) then
right here:
https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L520-L524
Lastly, if you set ./xmlchange DEBUG=TRUE, does that turn on floating point trapping as suggested by Vince (@cacraigucar)? And I presume this need to be ran on cheyenne (cuz it's such a large grid), and so I think only the intel compiler is used.
@adamrher Thanks for the suggestion. I will give it try. Also, after setting the DEBUG to TRUE as Vince suggested here, the run crashed with MPT ERROR as in this log file (https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404). How should I interpret this? (It is on Cheyenne with intel compiler.)
Sorry if I missed it, but is the crash in a 3km column and is that column at or near orography?
Both the 60km (aquaplanet) and 60-3km (full topography) run crashed with this CLUBB error when using L58. Can we locate those NaN values?
I am also trying to figure out how it works for the CAM-SE at 25km (setting up a test for that).
@adamrher Thanks for the suggestion. I will give it try. Also, after setting the DEBUG to TRUE as Vince suggested here, the run crashed with MPT ERROR as in this log file (https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404). How should I interpret this?
The line of code that contained the first FPE is here:
825:MPT: #1 0x00002b55ec5dc306 in mpi_sgi_system (
825:MPT: #2 MPI_SGI_stacktraceback (
825:MPT: header=header@entry=0x7ffeb377a050 "MPT ERROR: Rank 825(g:825) received signal SIGFPE(8).\n\tProcess ID: 64229, Host: r5i4n31, Program: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/bld/cesm.exe\n\tMPT Version: HPE MPT 2.22 03"...) at sig.c:340
825:MPT: #3 0x00002b55ec5dc4ff in first_arriver_handler (signo=signo@entry=8,
825:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b55f6c80080) at sig.c:489
825:MPT: #4 0x00002b55ec5dc793 in slave_sig_handler (signo=8, siginfo=<optimized out>,
825:MPT: extra=<optimized out>) at sig.c:565
825:MPT: #5 <signal handler called>
825:MPT: #6 0x000000000a554001 in __libm_log_l9 ()
825:MPT: #7 0x0000000002ce9481 in mo_drydep::drydep_xactive (sfc_temp=...,
825:MPT: pressure_sfc=..., wind_speed=..., spec_hum=..., air_temp=...,
825:MPT: pressure_10m=..., rain=..., snow=..., solar_flux=..., dvel=..., dflx=...,
825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12, ocnfrc=..., icefrc=...,
825:MPT: beglandtype=7, endlandtype=8)
825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110
825:MPT: #8 0x0000000002cb913d in mo_drydep::drydep_fromlnd (ocnfrac=..., icefrac=...,
825:MPT: sfc_temp=..., pressure_sfc=..., wind_speed=..., spec_hum=...,
825:MPT: air_temp=..., pressure_10m=..., rain=..., snow=..., solar_flux=...,
825:MPT: dvelocity=..., dflx=...,
825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12)
825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:210
825:MPT: #9 0x0000000002da7d99 in mo_gas_phase_chemdr::gas_phase_chemdr (lchnk=12,
825:MPT: ncol=16, imozart=10,
825:MPT: q=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, phis=..., zm=..., zi=..., calday=1.0208333333333333, tfld=...,
825:MPT: pmid=..., pdel=..., pint=..., cldw=..., troplev=..., troplevchem=...,
825:MPT: ncldwtr=..., ufld=..., vfld=..., delt=120, ps=..., xactive_prates=.FALSE.,
825:MPT: fsds=..., ts=..., asdir=..., ocnfrac=..., icefrac=..., precc=...,
825:MPT: precl=..., snowhland=..., ghg_chem=.FALSE., latmapback=..., drydepflx=...,
825:MPT: wetdepflx=..., cflx=..., fire_sflx=<not associated>,
825:MPT: fire_ztop=<not associated>, nhx_nitrogen_flx=..., noy_nitrogen_flx=...,
825:MPT: qtend=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, pbuf=0x2b75ed00a248)
825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_gas_phase_chemdr.F90:1063
825:MPT: #10 0x00000000025278a4 in chemistry::chem_timestep_tend (state=..., ptend=...,
825:MPT: cam_in=..., cam_out=..., dt=120, pbuf=0x2b75ed00a248, fh2o=...)
825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/chemistry.F90:1290
825:MPT: #11 0x0000000000ee1b7c in physpkg::tphysac (ztodt=120, cam_in=...,
825:MPT: cam_out=..., state=..., tend=..., pbuf=0x2b75ed00a248)
825:MPT: at /glade/work/xyhuang/CAM-1/src/physics/cam/physpkg.F90:1562
One question is whether this line of code has a FPE even when using standard CAM code that runs fine. If so, then it's a red herring. If not, then it would be interesting to know if the problem is an uninitialized variable, or if the FPE appears after a few time steps.
I see. I can set up a quick test using the 32 levels with the DEBUG on to check that out.
On Tue, Oct 5, 2021 at 1:03 PM Vincent Larson @.***> wrote:
@adamrher https://github.com/adamrher Thanks for the suggestion. I will give it try. Also, after setting the DEBUG to TRUE as Vince suggested here, the run crashed with MPT ERROR as in this log file ( https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404 https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404?rgh-link-date=2021-10-05T18%3A45%3A05Z). How should I interpret this?
The line of code that contained the first FPE is here:
825:MPT: #1 0x00002b55ec5dc306 in mpi_sgi_system ( 825:MPT: #2 MPI_SGI_stacktraceback ( 825:MPT: @.=0x7ffeb377a050 "MPT ERROR: Rank 825(g:825) received signal SIGFPE(8).\n\tProcess ID: 64229, Host: r5i4n31, Program: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/bld/cesm.exe\n\tMPT Version: HPE MPT 2.22 03"...) at sig.c:340 825:MPT: #3 0x00002b55ec5dc4ff in first_arriver_handler @.=8, 825:MPT: @.***=0x2b55f6c80080) at sig.c:489 825:MPT: #4 0x00002b55ec5dc793 in slave_sig_handler (signo=8, siginfo=
, 825:MPT: extra= ) at sig.c:565 825:MPT: #5 825:MPT: #6 0x000000000a554001 in __libm_log_l9 () 825:MPT: #7 0x0000000002ce9481 in mo_drydep::drydep_xactive (sfc_temp=..., 825:MPT: pressure_sfc=..., wind_speed=..., spec_hum=..., air_temp=..., 825:MPT: pressure_10m=..., rain=..., snow=..., solar_flux=..., dvel=..., dflx=..., 825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12, ocnfrc=..., icefrc=..., 825:MPT: beglandtype=7, endlandtype=8) 825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110 825:MPT: #8 0x0000000002cb913d in mo_drydep::drydep_fromlnd (ocnfrac=..., icefrac=..., 825:MPT: sfc_temp=..., pressure_sfc=..., wind_speed=..., spec_hum=..., 825:MPT: air_temp=..., pressure_10m=..., rain=..., snow=..., solar_flux=..., 825:MPT: dvelocity=..., dflx=..., 825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12) 825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:210 825:MPT: #9 0x0000000002da7d99 in mo_gas_phase_chemdr::gas_phase_chemdr (lchnk=12, 825:MPT: ncol=16, imozart=10, 825:MPT: q=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, phis=..., zm=..., zi=..., calday=1.0208333333333333, tfld=..., 825:MPT: pmid=..., pdel=..., pint=..., cldw=..., troplev=..., troplevchem=..., 825:MPT: ncldwtr=..., ufld=..., vfld=..., delt=120, ps=..., xactive_prates=.FALSE., 825:MPT: fsds=..., ts=..., asdir=..., ocnfrac=..., icefrac=..., precc=..., 825:MPT: precl=..., snowhland=..., ghg_chem=.FALSE., latmapback=..., drydepflx=..., 825:MPT: wetdepflx=..., cflx=..., fire_sflx= , 825:MPT: fire_ztop= , nhx_nitrogen_flx=..., noy_nitrogen_flx=..., 825:MPT: qtend=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, pbuf=0x2b75ed00a248) 825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_gas_phase_chemdr.F90:1063 825:MPT: #10 0x00000000025278a4 in chemistry::chem_timestep_tend (state=..., ptend=..., 825:MPT: cam_in=..., cam_out=..., dt=120, pbuf=0x2b75ed00a248, fh2o=...) 825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/chemistry.F90:1290 825:MPT: #11 0x0000000000ee1b7c in physpkg::tphysac (ztodt=120, cam_in=..., 825:MPT: cam_out=..., state=..., tend=..., pbuf=0x2b75ed00a248) 825:MPT: at /glade/work/xyhuang/CAM-1/src/physics/cam/physpkg.F90:1562 One question is whether this line of code has a FPE even when using standard CAM code that runs fine. If so, then it's a red herring. If not, then it would be interesting to know if the problem is an uninitialized variable, or if the FPE appears after a few time steps.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CAM/issues/442#issuecomment-934689075, or unsubscribe https://github.com/notifications/unsubscribe-auth/AV5GA5LAHKYUEVVLIJ6N4BDUFNDYBANCNFSM5FFNKEQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I see. I can set up a quick test using the 32 levels with the DEBUG on to check that out. …
@vlarson As tested, I did not notice a FPE when using the 32 levels, which runs without any issue. Also, for the 58 levels, the FPE appears after 15 steps.
@adamrher I have the log file with all the input/output variables to the subroutine calls (windm_edsclrm_rhs
, windm_edsclrm_lhs
, windm_edsclrm_solve
)
(or on Cheyenne: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/run/cesm.log.874406.chadmin1.ib0.cheyenne.ucar.edu.211005-160912)
Any further ideas about something (any particular values) being abnormal here?
To summarize, we have two leads:
(1) when clubb diffuses scalars # 9-12, it gives them NaNs on task 776. However, the same clubb error messages are being triggered for other tasks as well, but the log ends before those other tasks print out their updated values of the scalars (which I suspect would show NaNs just like tasks 776).
(2) we are getting floating point exceptions with DEBUG=TRUE, that are not present in the 120km 58 level MPAS runs. Not sure what to make of this.
I'm working with Xingying to reverse engineer these NaNs in (1). And will update the git issue if/when we learn anything.
There was also an issue with giving MPAS the right topography file. This seems to fix some of the issues (or all) with the 60km (uniform) MPAS 58L according to @PeterHjortLauritzen
To summarize, we have two leads:
(1) when clubb diffuses scalars # 9-12, it gives them NaNs on task 776. However, the same clubb error messages are being triggered for other tasks as well, but the log ends before those other tasks print out their updated values of the scalars (which I suspect would show NaNs just like tasks 776).
I wonder what you'd find if you print out values of scalars 9-12 whenever they are negative, or NaN, or too large. Maybe that would lead to an initialization problem.
(2) we are getting floating point exceptions with DEBUG=TRUE, that are not present in the 120km 58 level MPAS runs. Not sure what to make of this.
If 120 km can run stably for a year without strange output in scalars 9-12, but 60 km crashes after 32 time steps, then I speculate that something is mis-configured or left uninitialized in the 60 km run.
There was also an issue with giving MPAS the right topography file. This seems to fix some of the issues (or all) with the 60km (uniform) MPAS 58L according to @PeterHjortLauritzen
That fix is only for the 60km with Topography. @skamaroc has been seeing this issue with the 60km aquaplanet as well, I just recreated the case.
@MiCurry are there NaN's in the cesm.log? If so, for which variable?
@MiCurry are there NaN's in the cesm.log? If so, for which variable?
@adamrher Its the same as @xhuang-ncar, edsclrm
. @skamaroc's case is here: /glade/scratch/skamaroc/qpc6-60km-58L/run
.
Thanks @MiCurry looks like there are also NaNs in wpedsclrp
which isn't surprising because it is derived from edsclrm
. Seems like this might be a cheaper grid to debug, instead of using the 60-3km grid.
I've been paying attn to this code block:
! Decompose and back substitute for all eddy-scalar variables
call windm_edsclrm_solve( edsclr_dim, 0, & ! in
lhs, rhs, & ! in/out
solution ) ! out
if ( clubb_at_least_debug_level( 0 ) ) then
if ( err_code == clubb_fatal_error ) then
write(fstderr,*) "Fatal error solving for eddsclrm"
end if
end if
!----------------------------------------------------------------
! Update Eddy-diff. Passive Scalars
!----------------------------------------------------------------
edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)
I had Xingyihng print out lhs
, rhs
and solution
. I expected solution
to have NaNs for edsclrm
# 9, but to my surprise, solution
did not have any NaNs, whereas edsclrm
did. See:
I'd like to proceed with Vince's suggestion; write out the array if there are any NaNs. I would like to focus on lhs
, rhs
, solution
and edsclrm
first. And I'd like to comment out all the write statements with the if ( err_code == clubb_fatal_error ) then
conditional, so that the log's are easier to read. Anyone have any other suggestions? @xhuang-ncar or @MiCurry, can you do this?
I had Xingyihng print out
lhs
,rhs
andsolution
. I expectedsolution
to have NaNs foredsclrm
# 9, but to my surprise,solution
did not have any NaNs, whereasedsclrm
did. See:
One notable thing is that in this example, edsclrm 9 has NaNs at all grid levels. Usually if CLUBB suffers a numerical instability, NaNs appear first at just a few grid levels. Maybe there is a memory error.
It might be worth trying to free up memory by halving the tasks per node:
./xmlchange MAX_TASKS_PER_NODE=16,MAX_MPITASKS_PER_NODE=16
@adamrher Sure, I can write out the array when there are any NaNs for lhs, rhs, solution and edsclrm first. Not sure how to do that for every variable though (if needed). Oh, also, I tried to free up memory and it did not work.
could you also write out gr%nz
when there are NaN's for any of the arrays? This is just a double-check that there's no funny business with this expression:
edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)
From Xingying:
have tested for a month, and CAM-SE ne120pg3 works well with the 58 levels even without the updated ZM2. That means this CLUBB error issue is unique to MAPS?
It's starting to seem that way. I would like to know a little more about the 60km MPAS aqua-planet fails that Miles is reporting. They seem to be the same as the 60-3km fails, but we haven't yet confirmed if esclrm
is giving NaNs, but solution
is not, like it is for the 60-3km runs. I looked at his log and his errors come within a day or so, about 100 steps in. Since we've shown that ne120 (w/ zm1) and ne60 (w/ zm2) using 58 levels runs fine for at least a month, that does suggest these errors are unique to MPAS.
could you also write out
gr%nz
when there are NaN's for any of the arrays? This is just a double-check that there's no funny business with this expression:edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)
Sure. I will add that together.
From Xingying:
have tested for a month, and CAM-SE ne120pg3 works well with the 58 levels even without the updated ZM2. That means this CLUBB error issue is unique to MAPS?
It's starting to seem that way. I would like to know a little more about the 60km MPAS aqua-planet fails that Miles is reporting. They seem to be the same as the 60-3km fails, but we haven't yet confirmed if
esclrm
is giving NaNs, butsolution
is not, like it is for the 60-3km runs. I looked at his log and his errors come within a day or so, about 100 steps in. Since we've shown that ne120 (w/ zm1) and ne60 (w/ zm2) using 58 levels runs fine for at least a month, that does suggest these errors are unique to MPAS.
@MiCurry Could you share me the path of the ncdata for the 60km run with topography? Since that one works well with the 58 levels, I'd like to double check with the ncdata for the 60-3km case I am using. I am also trying to set up another aqua-planet simulation at 60km as Bill did previously to double check the NaN values . However, I am encountering into an error "ERROR: (shr_ncread_varDimNum) ERROR inq varid: xc". Any idea of the reason for this kind of error?
I had Xingyihng print out
lhs
,rhs
andsolution
. I expectedsolution
to have NaNs foredsclrm
# 9, but to my surprise,solution
did not have any NaNs, whereasedsclrm
did. See:
My apologies, the output is not correct here. I find an error in my code when trying to print out those values (for lhs, rhs, and solution).
could you also write out
gr%nz
when there are NaN's for any of the arrays? This is just a double-check that there's no funny business with this expression:edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)
Sure. I will add that together.
@adamrher (and everyone). For updates: I am now getting the NaN printed out for edsclrm, lhs, rhs and solution when occurring.
Here is what it looks like as a snapshot of the output in the log file:
The source the NaN error comes from the rhs value in the #9, 10, 11, and 12. It then induces the solution and edsclrm to NaNs, and the lhs is fine here. Here is the function that how the rhs (explicit portion of eddy scalar equation) is solved (advance_windm_edsclrm_module.F90):
call windm_edsclrm_rhs( windm_edsclrm_scalar, dt, dummy_nu, Kmh_zm, edsclrm(:,i), edsclrm_forcing, rho_ds_zm, invrs_rho_ds_zt, l_imp_sfc_momentum_flux, wpedsclrp(1,i), rhs(:,i) )
phew ... I thought we were in bizarro world. But this makes sense. The next step is to backtrack to see if any of the intent(in) vars of windm_edsclrm_rhs
are NaNs. This is my debug approach, keep moving back until you find the source of the NaNs.
I'm sorry this taking so long. Spending weeks debugging the same issue is probably the least fun aspect of model development. But we'll get past this.
@adamrher Right, that makes sense. Oh, it does take too long on this thing, but thanks for helping through. I will see what I can do ...
As tracking back to the windm_edsclrm_rhs
routine, I see the rho_ds_zm and invrs_rho_ds_zt have some NaN values too as input. How those two variables are calculated in CAM? I am thinking a quick fix for those two variables and see if that's the cause.
rho_ds_zm, & ! Dry, static density on momentum levels [kg/m^3]
invrs_rho_ds_zt ! Inv. dry, static density at thermo. levels [m^3/kg]
@adamrher Right, that makes sense. Oh, it does take too long on this thing, but thanks for helping through. I will see what I can do ...
As tracking back to the
windm_edsclrm_rhs
routine, I see the rho_ds_zm and invrs_rho_ds_zt have some NaN values too as input. How those two variables are calculated in CAM? I am thinking a quick fix for those two variables and see if that's the cause.rho_ds_zm, & ! Dry, static density on momentum levels [kg/m^3] invrs_rho_ds_zt ! Inv. dry, static density at thermo. levels [m^3/kg]
Those density variables are based on CAM's pressure fields. If the density fields have NaNs, then the solution has probably gone bad long before reaching this time step.
@xhuang-ncar I would print out rho_ds_zt
at the clubb_intr level, here:
https://github.com/ESCOMP/CAM/blob/3676c6ec1c8dfd19e20ea764c0226792574481f0/src/physics/cam/clubb_intr.F90#L2263
And condition the write statements on NaNs. I'd also print out the components state1%pdel(i,pver-k+1)
and dz_g(pver-k+1)
. This will confirm that the NaNs are being carried into CLUBB via state%
, and so must be generated further upstream in the CAM time-loop, as Vince suggests.
@xhuang-ncar I would print out
rho_ds_zt
at the clubb_intr level, here:And condition the write statements on NaNs. I'd also print out the components
state1%pdel(i,pver-k+1)
anddz_g(pver-k+1)
. This will confirm that the NaNs are being carried into CLUBB viastate%
, and so must be generated further upstream in the CAM time-loop, as Vince suggests.
I see. Before doing that, I am curious can we reset those NaN values and check out first if that's the cause of the error?
This might not be that relevant based on what we have been discussing, but I'm going to forward a comment from @skamaroc:
I'm looking into the CLUBB error we see in the 58 level configuration. The error message we are seeing is generated after the call to the tridiagonal solve for the vertical mixing of scalars. LAPACK is used in the tridiagonal solve, and here is the code that sets the error condition after it is called:
` select case( info ) case( :-1 ) write(fstderr,*) trim( solve_type )// & " illegal value in argument", -info err_code = clubb_fatal_error
solution = -999._core_rknd
case( 0 )
! Success!
if ( lapack_isnan( ndim, nrhs, rhs ) ) then
err_code = clubb_fatal_error
end if
solution = rhs
case( 1: )
write(fstderr,*) trim( solve_type )//" singular matrix."
err_code = clubb_fatal_error
solution = -999._core_rknd
end select
return
end subroutine tridag_solve `
Based on this code, it would appear that the LAPACK tridiagonal solve is returning NaNs in at least one column because it would have printed out an error message here for the other two error conditions ("singular matrix" or "illegal value"). A step we could take is to instrument the NaN check to tell us where this is occuring, and perhaps print out the inputs to the tridiagonal solve for that column. This might lead to bounding those inputs in some (perhaps physical) manner upstream of the tridiagonal solve. We're down in the weeds here and we would likely need help from Vince and his group to do this.
I also noticed CLUBB is using Crank-Nicholson in the eddy-mixing integration here. This is centered in time, and while second-order accurate in time it can lead to some odd oscillatory behavior for large eddy viscosities (large relative to the timestep and vertical mesh spacing). I didn't see an option for weighting the time levels differently (i.e. biasing the implicit contribution that while only first-order in time is better behaved for larger eddy viscosities). I do not know if large eddy viscosities are causing the problem. Any insights here?
@andrewgettelman it could be that Bill is ahead of us here, but I think we should press forward to see if we can't detect NaNs at the clubb_intr level for rho_ds_zt
like I suggested. If we can't detect them at clubb_intr then we should definitely try to replicate Bill's debugging in the vertical mixing of scalars in the clubb dycore.
Thanks @adamrher, I would agree with that assessment. We could print the values out as you suggest, and maybe reset them as well as @xhuang-ncar notes.
I'm guessing it will just blow up somewhere else if the cause is 'upstream' if CLUBB.
I wonder if (A) Xingying has tried this in debug mode to try to trap for the NaN's if that works? (B) we should not get some help in stoping the model at the timestep before and restarting with the debugger to see where the NaN's first appear.
But I guess first we should do as you both suggest and verify that CLUBB is getting bad input data.
@andrewgettelman it could be that Bill is ahead of us here, but I think we should press forward to see if we can't detect NaNs at the clubb_intr level for
rho_ds_zt
like I suggested. If we can't detect them at clubb_intr then we should definitely try to replicate Bill's debugging in the vertical mixing of scalars in the clubb dycore.
The NaNs are not detected at clubb_intr and neither before the windm_edsclrm_rhs routine.
Comments from @zarzycki :
We have seen similar issues in both our CLUBB CPT (over land, issues with PDC errors even with L32) and issues in E3SM (L72) with instabilities that are eliminated by changing the coupling and thickening the near surface layer...
Julio can probably fill you in on what we've seen in the CPT, but essentially we've mitigated some issues with improving the coupling. I have a branch of code that updates surface stress from CLM inside of the CLUBB loop -- I think Adam H. is working on a parallel fix that rearranges the BC/AC physics, which might be a preferable solution if it works and doesn't totally mess up the climate?
However, we still have some instabilities when the lowest model layer gets too thin. This seems to be due to interactions w/ the surface layer/fluxes and CLUBB diffusivities and can occur over ocean, too, which implies it may not be the exact same mechanism as the PDC errors from before. Just nuking the bottom layer (e.g., going from L72 to L71 in E3SM) seems to greatly alleviate issues, although not an optimal solution...
I'm thinking that it might help to determine whether there is an initialization problem. Do scalars 9-12 look bad already at the first time step? Perhaps some plots comparing the run that crashes and a run that works would be illuminating.
It might also be useful to know whether the problem occurs near the lower surface or aloft. If it's near the lower surface, the problem might have to do with either fine grid spacing, coupling with the surface, or surface emissions. If aloft, it might help to reduce the time step. Again, some plots might help.
I thought we has some action items from the mtg last Tuesday? We discussed trying to comment out the dry deposition code on account of this DEBUG=TRUE FPE error (see below). If you look at the actual line that's triggering the FPE (/glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110), lots could be going wrong. And then we determined that the RHS variable is all NaNs in clubb's scalar mixing subroutine, and so some further debugging there would be helpful too (perhaps as Vince suggests, pinpointing where in the column a bad value is triggering a whole column of NaNs). @xhuang-ncar any updates?
825:MPT: #1 0x00002b55ec5dc306 in mpi_sgi_system (
825:MPT: #2 MPI_SGI_stacktraceback (
825:MPT: header=header@entry=0x7ffeb377a050 "MPT ERROR: Rank 825(g:825) received signal SIGFPE(8).\n\tProcess ID: 64229, Host: r5i4n31, Program: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/bld/cesm.exe\n\tMPT Version: HPE MPT 2.22 03"...) at sig.c:340
825:MPT: #3 0x00002b55ec5dc4ff in first_arriver_handler (signo=signo@entry=8,
825:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b55f6c80080) at sig.c:489
825:MPT: #4 0x00002b55ec5dc793 in slave_sig_handler (signo=8, siginfo=
I spoke very briefly with @vlarson about this yesterday and wanted to share some detail that may be helpful.
A.) I am certainly a bit gunshy, but we did have a bit of a red herring issue a few years ago where I was convinced there was an error in CLM5 + updated CAM-SE, but it really was just a better error trap from CLM's side. Chasing NaN's and stack traces led me down the wrong path -- it was only once I started dumping all state fields every timestep that I was able to pinpoint the regime causing the blowups.
B.) At the head of the E3SM repo, there have been modifications to CLUBB tuning compared to v1. These changes seem to have resulted in situations with "noise," even in vertically integrated fields. For example, @jjbenedict is running some VR-E3SM simulations with me for a DoE project and noted noise in the precip field of Hurricane Irene hindcasts (below) with dev code at E3SM head (2 subtly different tuning configs). A quick look at the published v1 data showed either no (or at least much less) noise, which seems to imply a response to changes in CLUBB tuning.
NOTE: while we are focused on the TC, you can also see some noise in the top right, which means this may not be 100% a TC response. NOTEx2: this noise is also apparent in at least some CLUBB moments -- I don't have any of these runs anymore, but distinctly remembering seeing ~2dx mode in upwp.
After removing the lowest model level (i.e., merging lev(71) and lev(70) from a 0-based perspective), this raises the lowest model level from ~15m to ~40m (still lower than CAM6's L32...). Just "thickening" the lowest model level eliminates such instability, even with everything else (init, config, etc.) identical as before (again, the only change is L72 -> L71). See below precip snapshot.
Anecdotally, we have also had crashes (generally over cold mountainous areas) with L72 that go away with L71. Haven't followed up too closely, however.
My (completely unsubstantiated) hypothesis is the both CLUBB and the dycore are essentially providing vertical diffusion to the model -- whether that be considered implicit or explicit. Generally, this isn't an issue with thicker layers, but thinner layers have less mass/inertia and are more susceptible to CFL errors, so it's possible they are also more susceptible to the combined diffusion from the dynamics + physics. This may also help explain some of the horizontal resolution sensitivity, as the dycore's explicit diffusion will scale with resolution, so some of the power in the smaller scales + sharper gradients will still modulate the vertical diffusion, even in the absence of explicit changes to the vertical coord.
Anyway, my (super stupid easy test) would be to run a similar case with MPAS to the one that crashes but with a thicker lowest level -- everything else (horiz res, init, forcing, config) remains identical. Merging the bottom two levs was the easiest test for me, but I suppose just shifting everything up a bit probably would work. If the model follows the same stability trajectory, we are right where we started, but if it doesn't, it provides a more focused target for debugging.
Question for @vlarson: I noticed that the vertical dissipation in CLUBB is implemented using semi-implicit Crank-Nicolson time integration. This can produce oscillatory behavior for large eddy viscosities (i.e. large K dt/dz^2). Is there a way we can change the weights on the time levels in the integration? It looks like the weights (1/2, 1/2) are hardwired in the code. It may be that the vertical discretization in the nonhydrostatic solver in MPAS, that uses a Lorenz vertical staggering, is not happy with oscillations that may be produced by the mixing in these thin layers. I'm also wondering about potential feedback from the mixing and the nonlinear computation of the eddy viscosities.
I recently heard from Adam H that when the model is compiled with debugging ON, the crash is seen to originate in the dry deposition code, not in CLUBB. There is a particular line flagged that involves calculations with surface pressure. Am I misunderstanding?
On Fri, Oct 22, 2021 at 10:06 AM skamaroc @.***> wrote:
Question for @vlarson https://github.com/vlarson: I noticed that the vertical dissipation in CLUBB is implemented using semi-implicit Crank-Nicolson time integration. This can produce oscillatory behavior for large eddy viscosities (i.e. large K dt/dz^2). Is there a way we can change the weights on the time levels in the integration? It looks like the weights (1/2, 1/2) are hardwired in the code. It may be that the vertical discretization in the nonhydrostatic solver in MPAS, that uses a Lorenz vertical staggering, is not happy with oscillations that may be produced by the mixing in these thin layers. I'm also wondering about potential feedback from the mixing and the nonlinear computation of the eddy viscosities.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CAM/issues/442#issuecomment-949765792, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGLMTXMIUFDQRRW36MAAVDUIGDZRANCNFSM5FFNKEQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@adamrher, can you clarify @JulioTBacmeister 's comment... might it be coming from surface pressure calculations in the dry dep code? Thanks!
@andrewgettelman my earlier comment seems to have gotten lost in the thread:
I thought we has some action items from the mtg last Tuesday? We discussed trying to comment out the dry deposition code on account of this DEBUG=TRUE FPE error (see below). If you look at the actual line that's triggering the FPE (/glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110), lots could be going wrong.
And here is the line
cvarb = vonkar/log( z(i)/z0b(i) )
where z is an elevation derived from hydrostatic balance, and so uses the surface pressure.
In parallel, I think we should also consider @skamaroc point, that the time integration in CLUBB eddy diffusivity may be producing oscillations that does not play well with the MPAS solver. On our end, I suggest we continue with the other action item from last weeks mtg:
And then we determined that the RHS variable is all NaNs in clubb's scalar mixing subroutine, and so some further debugging there would be helpful too (perhaps as Vince suggests, pinpointing where in the column a bad value is triggering a whole column of NaNs). @xhuang-ncar any updates?
Opening an issue to describe crashes with high vertical resolution.
So far this has only been seen with higher resolution simulations, and with CAM-MPAS.
The basic test case is 58L CAM-MPAS aquaplanet crashes almost immediately with an error from CLUBB:
The errors are coming out of CLUBB ( we are not necessarily convinced it's CLUBB's fault yet) in advance_windm_edsclrm_module.F90.
The error is:
405: Fatal error solving for eddsclrm 405: Error in advance_windm_edsclrm
The error has been seen by @skamaroc and Xingying Huang (not sure their github names yet).
Vince Larson notes that:
The cause could be initial conditions (not initializing CLUBB variables). Or it could be upstream of CLUBB (and be the input winds).
Still trying to debug....