ESCOMP / CAM

Community Atmosphere Model
71 stars 133 forks source link

CAM Crashes with 58 levels and higher horiz resolution #442

Closed andrewgettelman closed 2 years ago

andrewgettelman commented 2 years ago

Opening an issue to describe crashes with high vertical resolution.

So far this has only been seen with higher resolution simulations, and with CAM-MPAS.

The basic test case is 58L CAM-MPAS aquaplanet crashes almost immediately with an error from CLUBB:

The errors are coming out of CLUBB ( we are not necessarily convinced it's CLUBB's fault yet) in advance_windm_edsclrm_module.F90.

The error is:

405: Fatal error solving for eddsclrm 405: Error in advance_windm_edsclrm

The error has been seen by @skamaroc and Xingying Huang (not sure their github names yet).

Vince Larson notes that:

edsclrm is CLUBB's array of scalars diffused by CLUBB's eddy diffusivity. windm is CLUBB's representation of the horizontal wind components. ("m" in CLUBB-speak denotes "grid-mean." So "windm" refers to the grid-mean values of u and v.)

I am guessing that the wind is the problem, not the eddy scalars, which are chemical species, etc.

The cause could be initial conditions (not initializing CLUBB variables). Or it could be upstream of CLUBB (and be the input winds).

Still trying to debug....

vlarson commented 2 years ago

edsclrm is CLUBB's array of scalars diffused by CLUBB's eddy diffusivity. windm is CLUBB's representation of the horizontal wind components. ("m" in CLUBB-speak denotes "grid-mean." So "windm" refers to the grid-mean values of u and v.) I am guessing that the wind is the problem, not the eddy scalars, which are chemical species, etc.

However, if the model crashes within 10 time steps, even in aquaplanet mode, when it runs with 120-km resolution, then perhaps a variable is not initialized properly. Maybe running with floating point trapping turned on could catch the first NaN, which might lead us to an initialization error.

xhuang-ncar commented 2 years ago

@vlarson The model works for CAM-MPAS at 120km resolution with ZM2 (replaced the zm_conv_intr.F90 and zm_conv.F90 as shared by Adam). However, that solution does not work for the 60km aquaplanet run (Bill tested here) or the 60-3km full-topography one (I did here) with the same error message.

Here are the first NaN values printed out in my cesm log file:

edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.

If setting the edsclrm values to zeros will cause other issues with model crashing at the same time step.

The code blocks are highlighted here by Adam for the functions windm_edsclrm_rhs, windm_edsclrm_lhs, and windm_edsclrm_solve and the clubb_at_least_debug_level:

https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L464-L524

https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L555-L593

swrneale commented 2 years ago

Sorry if I missed it, but is the crash in a 3km column and is that column at or near orography?

vlarson commented 2 years ago

edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.

Is edsclrm # 9, 10, 11, or 12 initialized to a reasonable value?

Turning on floating point trapping might catch an uninitialized variable if there is one.

MiCurry commented 2 years ago

Is the dynamics running at all? Or does this failure happen before the dynamics is ran?

andrewgettelman commented 2 years ago

@swrneale : we can get the crash in an aquaplanet model, so no land or topography. We have not located the point and level where it is happening. @xhuang-ncar will need some guidance on that (I don't really know how to pull out a column). Thanks!

JulioTBacmeister commented 2 years ago

Could you send the location of your log files?

On Mon, Oct 4, 2021 at 11:09 AM xhuang-ncar @.***> wrote:

The model works for CAM-MPAS at 120km resolution with ZM2 (replaced the zm_conv_intr.F90 and zm_conv.F90 as shared by Adam). However, that solution does not work for the 60km aquaplanet run (Bill tested here) or the 60-3km full-topography one (I did here) with the same error message.

Here are the first NaN values printed out in my cesm log file:

edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.

If setting the edsclrm values to zeros will cause other issues with model crashing at the same time step.

The code blocks are highlighted here by Adam for the functions windm_edsclrm_rhs, windm_edsclrm_lhs, and windm_edsclrm_solve and the clubb_at_least_debug_level: https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L464-L524

https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L555-L593

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CAM/issues/442#issuecomment-933683182, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGLMTWOMEDFGYQFEXVK2PDUFHNUPANCNFSM5FFNKEQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

xhuang-ncar commented 2 years ago

@JulioTBacmeister Sure. The log file is here: https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.660137.chadmin1.ib0.cheyenne.ucar.edu.210925-034021

xhuang-ncar commented 2 years ago

Is the dynamics running at all? Or does this failure happen before the dynamics is ran?

Yes, it was running. It crashed after 16 time steps. Also, I am using 120s as the dtime given the refined 3km resolution.

xhuang-ncar commented 2 years ago

edsclrm # 9 = NaN Na NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 ... 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN edsclrm # 10, edsclrm # 11 and edsclrm # 12 are all NaN values.

Is edsclrm # 9, 10, 11, or 12 initialized to a reasonable value?

Turning on floating point trapping might catch an uninitialized variable if there is one.

Not sure about that. How should that be initialized normally? Let me turn on the debug mode and also to print out the input variables in the windm_edsclrm_solve function to the log file.

JulioTBacmeister commented 2 years ago

Thanks Xingying, This is a question for everyone. The log file shows error messages that clearly originate in Adam's blocked out portions of 'advance_windm_edsclrm', yet the messages in the log file don't advance through the entire if ( err_code == clubb_fatal_error ) then ... endif block. They are always interrupted by what appears to a new entry to the error block, e.g.:

776: up2 = 0.351420540093032 0.297728678007049 0.266079981897943 776: 0.247924301489710 0.234454106712590 0.222438014461757 776: 0.210447703011471 0.199152298849457 0.192144528317254 776: 0.195549715651495 0.214431704931684 0.244532312820753 800: Fatal error solving for eddsclrm 800: Error in advance_windm_edsclrm 800: Intent(in) 800: dt = 40.0000000000000 800: wm_zt = 0.000000000000000E+000 9.109764803756008E-005 800: 3.344162123594033E-004 6.675431334319516E-004 1.065847183270905E-003

So here as up2 is being written out suddenly a new error stream starts. Does this mean the code is failing on multiple processes at the same time? If this is so, can the error messages be forced to go through the entire block so that we get a clear idea of the profiles going into advance_windm_edslcrm?

On Mon, Oct 4, 2021 at 12:31 PM xhuang-ncar @.***> wrote:

@JulioTBacmeister https://github.com/JulioTBacmeister Sure. The log file is here: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/run/cesm.log.660137.chadmin1.ib0.cheyenne.ucar.edu.210925-034021

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CAM/issues/442#issuecomment-933745862, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGLMTVVL6LGVIQUQLABR2LUFHXIRANCNFSM5FFNKEQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

adamrher commented 2 years ago

@JulioTBacmeister Looking through the log, I am only seeing NaN's for processor 776. However, I am finding that the entire write statement code block is being written out for 776, because its last entries in the cesm.log are:

776: wpedsclrp = 0.000000000000000E+000 1.038482017559906E-005 776: 9.159381691910688E-006 7.466857448689597E-006 7.067816049197199E-006 776: 8.285625416776367E-006 1.226171921624444E-005 1.963394231215161E-005 776: 3.335552805102376E-005 4.535006155606812E-005 4.986023016470998E-005 776: 5.817265387986108E-005 6.110989684521623E-005 2.089284626521818E-005 776: 2.964944981635684E-006 2.922172823855216E-007 2.468995295154835E-008 776: 1.433156461244229E-008 3.468759159148850E-008 2.422783580467906E-008 ...

Where wpedsclrp is the last entry in that code block: https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L555-L593

But 776 isn't the only processor with "Error in advance_windm_edsclrm," but again, 776 is the only one with NaNs.

If it were me, I'd probably write out all the input/output variables to the subroutine calls windm_edsclrm_rhs, windm_edsclrm_lhs, windm_edsclrm_solve. I'd first try conditioning these write statements on if ( err_code == clubb_fatal_error ) then right here: https://github.com/ESCOMP/CLUBB_CESM/blob/f6fd53041aac4aa40238df86d30a6aff0e74a8fb/advance_windm_edsclrm_module.F90#L520-L524

Lastly, if you set ./xmlchange DEBUG=TRUE, does that turn on floating point trapping as suggested by Vince (@cacraigucar)? And I presume this need to be ran on cheyenne (cuz it's such a large grid), and so I think only the intel compiler is used.

xhuang-ncar commented 2 years ago

@adamrher Thanks for the suggestion. I will give it try. Also, after setting the DEBUG to TRUE as Vince suggested here, the run crashed with MPT ERROR as in this log file (https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404). How should I interpret this? (It is on Cheyenne with intel compiler.)

xhuang-ncar commented 2 years ago

Sorry if I missed it, but is the crash in a 3km column and is that column at or near orography?

Both the 60km (aquaplanet) and 60-3km (full topography) run crashed with this CLUBB error when using L58. Can we locate those NaN values?

I am also trying to figure out how it works for the CAM-SE at 25km (setting up a test for that).

vlarson commented 2 years ago

@adamrher Thanks for the suggestion. I will give it try. Also, after setting the DEBUG to TRUE as Vince suggested here, the run crashed with MPT ERROR as in this log file (https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404). How should I interpret this?

The line of code that contained the first FPE is here:

825:MPT: #1  0x00002b55ec5dc306 in mpi_sgi_system (
825:MPT: #2  MPI_SGI_stacktraceback (
825:MPT:     header=header@entry=0x7ffeb377a050 "MPT ERROR: Rank 825(g:825) received signal SIGFPE(8).\n\tProcess ID: 64229, Host: r5i4n31, Program: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/bld/cesm.exe\n\tMPT Version: HPE MPT 2.22  03"...) at sig.c:340
825:MPT: #3  0x00002b55ec5dc4ff in first_arriver_handler (signo=signo@entry=8, 
825:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2b55f6c80080) at sig.c:489
825:MPT: #4  0x00002b55ec5dc793 in slave_sig_handler (signo=8, siginfo=<optimized out>, 
825:MPT:     extra=<optimized out>) at sig.c:565
825:MPT: #5  <signal handler called>
825:MPT: #6  0x000000000a554001 in __libm_log_l9 ()
825:MPT: #7  0x0000000002ce9481 in mo_drydep::drydep_xactive (sfc_temp=..., 
825:MPT:     pressure_sfc=..., wind_speed=..., spec_hum=..., air_temp=..., 
825:MPT:     pressure_10m=..., rain=..., snow=..., solar_flux=..., dvel=..., dflx=..., 
825:MPT:     mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12, ocnfrc=..., icefrc=..., 
825:MPT:     beglandtype=7, endlandtype=8)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110
825:MPT: #8  0x0000000002cb913d in mo_drydep::drydep_fromlnd (ocnfrac=..., icefrac=..., 
825:MPT:     sfc_temp=..., pressure_sfc=..., wind_speed=..., spec_hum=..., 
825:MPT:     air_temp=..., pressure_10m=..., rain=..., snow=..., solar_flux=..., 
825:MPT:     dvelocity=..., dflx=..., 
825:MPT:     mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:210
825:MPT: #9  0x0000000002da7d99 in mo_gas_phase_chemdr::gas_phase_chemdr (lchnk=12, 
825:MPT:     ncol=16, imozart=10, 
825:MPT:     q=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, phis=..., zm=..., zi=..., calday=1.0208333333333333, tfld=..., 
825:MPT:     pmid=..., pdel=..., pint=..., cldw=..., troplev=..., troplevchem=..., 
825:MPT:     ncldwtr=..., ufld=..., vfld=..., delt=120, ps=..., xactive_prates=.FALSE., 
825:MPT:     fsds=..., ts=..., asdir=..., ocnfrac=..., icefrac=..., precc=..., 
825:MPT:     precl=..., snowhland=..., ghg_chem=.FALSE., latmapback=..., drydepflx=..., 
825:MPT:     wetdepflx=..., cflx=..., fire_sflx=<not associated>, 
825:MPT:     fire_ztop=<not associated>, nhx_nitrogen_flx=..., noy_nitrogen_flx=..., 
825:MPT:     qtend=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, pbuf=0x2b75ed00a248)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_gas_phase_chemdr.F90:1063
825:MPT: #10 0x00000000025278a4 in chemistry::chem_timestep_tend (state=..., ptend=..., 
825:MPT:     cam_in=..., cam_out=..., dt=120, pbuf=0x2b75ed00a248, fh2o=...)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/chemistry.F90:1290
825:MPT: #11 0x0000000000ee1b7c in physpkg::tphysac (ztodt=120, cam_in=..., 
825:MPT:     cam_out=..., state=..., tend=..., pbuf=0x2b75ed00a248)
825:MPT:     at /glade/work/xyhuang/CAM-1/src/physics/cam/physpkg.F90:1562

One question is whether this line of code has a FPE even when using standard CAM code that runs fine. If so, then it's a red herring. If not, then it would be interesting to know if the problem is an uninitialized variable, or if the FPE appears after a few time steps.

xhuang-ncar commented 2 years ago

I see. I can set up a quick test using the 32 levels with the DEBUG on to check that out.

On Tue, Oct 5, 2021 at 1:03 PM Vincent Larson @.***> wrote:

@adamrher https://github.com/adamrher Thanks for the suggestion. I will give it try. Also, after setting the DEBUG to TRUE as Vince suggested here, the run crashed with MPT ERROR as in this log file ( https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404 https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.864693.chadmin1.ib0.cheyenne.ucar.edu.211004-224404?rgh-link-date=2021-10-05T18%3A45%3A05Z). How should I interpret this?

The line of code that contained the first FPE is here:

825:MPT: #1 0x00002b55ec5dc306 in mpi_sgi_system ( 825:MPT: #2 MPI_SGI_stacktraceback ( 825:MPT: @.=0x7ffeb377a050 "MPT ERROR: Rank 825(g:825) received signal SIGFPE(8).\n\tProcess ID: 64229, Host: r5i4n31, Program: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/bld/cesm.exe\n\tMPT Version: HPE MPT 2.22 03"...) at sig.c:340 825:MPT: #3 0x00002b55ec5dc4ff in first_arriver_handler @.=8, 825:MPT: @.***=0x2b55f6c80080) at sig.c:489 825:MPT: #4 0x00002b55ec5dc793 in slave_sig_handler (signo=8, siginfo=, 825:MPT: extra=) at sig.c:565 825:MPT: #5 825:MPT: #6 0x000000000a554001 in __libm_log_l9 () 825:MPT: #7 0x0000000002ce9481 in mo_drydep::drydep_xactive (sfc_temp=..., 825:MPT: pressure_sfc=..., wind_speed=..., spec_hum=..., air_temp=..., 825:MPT: pressure_10m=..., rain=..., snow=..., solar_flux=..., dvel=..., dflx=..., 825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12, ocnfrc=..., icefrc=..., 825:MPT: beglandtype=7, endlandtype=8) 825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110 825:MPT: #8 0x0000000002cb913d in mo_drydep::drydep_fromlnd (ocnfrac=..., icefrac=..., 825:MPT: sfc_temp=..., pressure_sfc=..., wind_speed=..., spec_hum=..., 825:MPT: air_temp=..., pressure_10m=..., rain=..., snow=..., solar_flux=..., 825:MPT: dvelocity=..., dflx=..., 825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12) 825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:210 825:MPT: #9 0x0000000002da7d99 in mo_gas_phase_chemdr::gas_phase_chemdr (lchnk=12, 825:MPT: ncol=16, imozart=10, 825:MPT: q=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, phis=..., zm=..., zi=..., calday=1.0208333333333333, tfld=..., 825:MPT: pmid=..., pdel=..., pint=..., cldw=..., troplev=..., troplevchem=..., 825:MPT: ncldwtr=..., ufld=..., vfld=..., delt=120, ps=..., xactive_prates=.FALSE., 825:MPT: fsds=..., ts=..., asdir=..., ocnfrac=..., icefrac=..., precc=..., 825:MPT: precl=..., snowhland=..., ghg_chem=.FALSE., latmapback=..., drydepflx=..., 825:MPT: wetdepflx=..., cflx=..., fire_sflx=, 825:MPT: fire_ztop=, nhx_nitrogen_flx=..., noy_nitrogen_flx=..., 825:MPT: qtend=<error reading variable: value requires 244992 bytes, which is more than max-value-size>, pbuf=0x2b75ed00a248) 825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_gas_phase_chemdr.F90:1063 825:MPT: #10 0x00000000025278a4 in chemistry::chem_timestep_tend (state=..., ptend=..., 825:MPT: cam_in=..., cam_out=..., dt=120, pbuf=0x2b75ed00a248, fh2o=...) 825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/chemistry.F90:1290 825:MPT: #11 0x0000000000ee1b7c in physpkg::tphysac (ztodt=120, cam_in=..., 825:MPT: cam_out=..., state=..., tend=..., pbuf=0x2b75ed00a248) 825:MPT: at /glade/work/xyhuang/CAM-1/src/physics/cam/physpkg.F90:1562

One question is whether this line of code has a FPE even when using standard CAM code that runs fine. If so, then it's a red herring. If not, then it would be interesting to know if the problem is an uninitialized variable, or if the FPE appears after a few time steps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CAM/issues/442#issuecomment-934689075, or unsubscribe https://github.com/notifications/unsubscribe-auth/AV5GA5LAHKYUEVVLIJ6N4BDUFNDYBANCNFSM5FFNKEQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

xhuang-ncar commented 2 years ago

I see. I can set up a quick test using the 32 levels with the DEBUG on to check that out.

@vlarson As tested, I did not notice a FPE when using the 32 levels, which runs without any issue. Also, for the 58 levels, the FPE appears after 15 steps.

@adamrher I have the log file with all the input/output variables to the subroutine calls (windm_edsclrm_rhs, windm_edsclrm_lhs, windm_edsclrm_solve)

https://github.com/xhuang-ncar/CAM-MPAS-L58-issue/blob/main/cesm.log.874406.chadmin1.ib0.cheyenne.ucar.edu.211005-160912

(or on Cheyenne: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/run/cesm.log.874406.chadmin1.ib0.cheyenne.ucar.edu.211005-160912)

Any further ideas about something (any particular values) being abnormal here?

adamrher commented 2 years ago

To summarize, we have two leads:

(1) when clubb diffuses scalars # 9-12, it gives them NaNs on task 776. However, the same clubb error messages are being triggered for other tasks as well, but the log ends before those other tasks print out their updated values of the scalars (which I suspect would show NaNs just like tasks 776).

(2) we are getting floating point exceptions with DEBUG=TRUE, that are not present in the 120km 58 level MPAS runs. Not sure what to make of this.

I'm working with Xingying to reverse engineer these NaNs in (1). And will update the git issue if/when we learn anything.

andrewgettelman commented 2 years ago

There was also an issue with giving MPAS the right topography file. This seems to fix some of the issues (or all) with the 60km (uniform) MPAS 58L according to @PeterHjortLauritzen

vlarson commented 2 years ago

To summarize, we have two leads:

(1) when clubb diffuses scalars # 9-12, it gives them NaNs on task 776. However, the same clubb error messages are being triggered for other tasks as well, but the log ends before those other tasks print out their updated values of the scalars (which I suspect would show NaNs just like tasks 776).

I wonder what you'd find if you print out values of scalars 9-12 whenever they are negative, or NaN, or too large. Maybe that would lead to an initialization problem.

(2) we are getting floating point exceptions with DEBUG=TRUE, that are not present in the 120km 58 level MPAS runs. Not sure what to make of this.

If 120 km can run stably for a year without strange output in scalars 9-12, but 60 km crashes after 32 time steps, then I speculate that something is mis-configured or left uninitialized in the 60 km run.

MiCurry commented 2 years ago

There was also an issue with giving MPAS the right topography file. This seems to fix some of the issues (or all) with the 60km (uniform) MPAS 58L according to @PeterHjortLauritzen

That fix is only for the 60km with Topography. @skamaroc has been seeing this issue with the 60km aquaplanet as well, I just recreated the case.

adamrher commented 2 years ago

@MiCurry are there NaN's in the cesm.log? If so, for which variable?

MiCurry commented 2 years ago

@MiCurry are there NaN's in the cesm.log? If so, for which variable?

@adamrher Its the same as @xhuang-ncar, edsclrm. @skamaroc's case is here: /glade/scratch/skamaroc/qpc6-60km-58L/run.

adamrher commented 2 years ago

Thanks @MiCurry looks like there are also NaNs in wpedsclrp which isn't surprising because it is derived from edsclrm. Seems like this might be a cheaper grid to debug, instead of using the 60-3km grid.

I've been paying attn to this code block:

  ! Decompose and back substitute for all eddy-scalar variables
  call windm_edsclrm_solve( edsclr_dim, 0, &     ! in
                            lhs, rhs, &          ! in/out
                            solution )           ! out

  if ( clubb_at_least_debug_level( 0 ) ) then
    if ( err_code == clubb_fatal_error ) then
      write(fstderr,*) "Fatal error solving for eddsclrm"
    end if
  end if

  !----------------------------------------------------------------
  ! Update Eddy-diff. Passive Scalars
  !----------------------------------------------------------------
  edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)

I had Xingyihng print out lhs, rhs and solution. I expected solution to have NaNs for edsclrm # 9, but to my surprise, solution did not have any NaNs, whereas edsclrm did. See:

Screen Shot 2021-10-06 at 4 53 31 PM Screen Shot 2021-10-06 at 4 53 16 PM

I'd like to proceed with Vince's suggestion; write out the array if there are any NaNs. I would like to focus on lhs, rhs, solution and edsclrm first. And I'd like to comment out all the write statements with the if ( err_code == clubb_fatal_error ) then conditional, so that the log's are easier to read. Anyone have any other suggestions? @xhuang-ncar or @MiCurry, can you do this?

vlarson commented 2 years ago

I had Xingyihng print out lhs, rhs and solution. I expected solution to have NaNs for edsclrm # 9, but to my surprise, solution did not have any NaNs, whereas edsclrm did. See:

One notable thing is that in this example, edsclrm 9 has NaNs at all grid levels. Usually if CLUBB suffers a numerical instability, NaNs appear first at just a few grid levels. Maybe there is a memory error.

adamrher commented 2 years ago

It might be worth trying to free up memory by halving the tasks per node: ./xmlchange MAX_TASKS_PER_NODE=16,MAX_MPITASKS_PER_NODE=16

xhuang-ncar commented 2 years ago

@adamrher Sure, I can write out the array when there are any NaNs for lhs, rhs, solution and edsclrm first. Not sure how to do that for every variable though (if needed). Oh, also, I tried to free up memory and it did not work.

adamrher commented 2 years ago

could you also write out gr%nz when there are NaN's for any of the arrays? This is just a double-check that there's no funny business with this expression:

  edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)
adamrher commented 2 years ago

From Xingying:

have tested for a month, and CAM-SE ne120pg3 works well with the 58 levels even without the updated ZM2. That means this CLUBB error issue is unique to MAPS?

It's starting to seem that way. I would like to know a little more about the 60km MPAS aqua-planet fails that Miles is reporting. They seem to be the same as the 60-3km fails, but we haven't yet confirmed if esclrm is giving NaNs, but solution is not, like it is for the 60-3km runs. I looked at his log and his errors come within a day or so, about 100 steps in. Since we've shown that ne120 (w/ zm1) and ne60 (w/ zm2) using 58 levels runs fine for at least a month, that does suggest these errors are unique to MPAS.

xhuang-ncar commented 2 years ago

could you also write out gr%nz when there are NaN's for any of the arrays? This is just a double-check that there's no funny business with this expression:

  edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)

Sure. I will add that together.

xhuang-ncar commented 2 years ago

From Xingying:

have tested for a month, and CAM-SE ne120pg3 works well with the 58 levels even without the updated ZM2. That means this CLUBB error issue is unique to MAPS?

It's starting to seem that way. I would like to know a little more about the 60km MPAS aqua-planet fails that Miles is reporting. They seem to be the same as the 60-3km fails, but we haven't yet confirmed if esclrm is giving NaNs, but solution is not, like it is for the 60-3km runs. I looked at his log and his errors come within a day or so, about 100 steps in. Since we've shown that ne120 (w/ zm1) and ne60 (w/ zm2) using 58 levels runs fine for at least a month, that does suggest these errors are unique to MPAS.

@MiCurry Could you share me the path of the ncdata for the 60km run with topography? Since that one works well with the 58 levels, I'd like to double check with the ncdata for the 60-3km case I am using. I am also trying to set up another aqua-planet simulation at 60km as Bill did previously to double check the NaN values . However, I am encountering into an error "ERROR: (shr_ncread_varDimNum) ERROR inq varid: xc". Any idea of the reason for this kind of error?

xhuang-ncar commented 2 years ago

I had Xingyihng print out lhs, rhs and solution. I expected solution to have NaNs for edsclrm # 9, but to my surprise, solution did not have any NaNs, whereas edsclrm did. See:

My apologies, the output is not correct here. I find an error in my code when trying to print out those values (for lhs, rhs, and solution).

xhuang-ncar commented 2 years ago

could you also write out gr%nz when there are NaN's for any of the arrays? This is just a double-check that there's no funny business with this expression:

  edsclrm(1:gr%nz,1:edsclr_dim) = solution(1:gr%nz,1:edsclr_dim)

Sure. I will add that together.

@adamrher (and everyone). For updates: I am now getting the NaN printed out for edsclrm, lhs, rhs and solution when occurring.

Here is what it looks like as a snapshot of the output in the log file:

Screen Shot 2021-10-08 at 4 13 55 PM

The source the NaN error comes from the rhs value in the #9, 10, 11, and 12. It then induces the solution and edsclrm to NaNs, and the lhs is fine here. Here is the function that how the rhs (explicit portion of eddy scalar equation) is solved (advance_windm_edsclrm_module.F90):

call windm_edsclrm_rhs( windm_edsclrm_scalar, dt, dummy_nu, Kmh_zm, edsclrm(:,i), edsclrm_forcing, rho_ds_zm, invrs_rho_ds_zt, l_imp_sfc_momentum_flux, wpedsclrp(1,i), rhs(:,i) )

adamrher commented 2 years ago

phew ... I thought we were in bizarro world. But this makes sense. The next step is to backtrack to see if any of the intent(in) vars of windm_edsclrm_rhs are NaNs. This is my debug approach, keep moving back until you find the source of the NaNs.

I'm sorry this taking so long. Spending weeks debugging the same issue is probably the least fun aspect of model development. But we'll get past this.

xhuang-ncar commented 2 years ago

@adamrher Right, that makes sense. Oh, it does take too long on this thing, but thanks for helping through. I will see what I can do ...

As tracking back to the windm_edsclrm_rhs routine, I see the rho_ds_zm and invrs_rho_ds_zt have some NaN values too as input. How those two variables are calculated in CAM? I am thinking a quick fix for those two variables and see if that's the cause.

rho_ds_zm,       & ! Dry, static density on momentum levels      [kg/m^3]
invrs_rho_ds_zt    ! Inv. dry, static density at thermo. levels  [m^3/kg]
vlarson commented 2 years ago

@adamrher Right, that makes sense. Oh, it does take too long on this thing, but thanks for helping through. I will see what I can do ...

As tracking back to the windm_edsclrm_rhs routine, I see the rho_ds_zm and invrs_rho_ds_zt have some NaN values too as input. How those two variables are calculated in CAM? I am thinking a quick fix for those two variables and see if that's the cause.

rho_ds_zm,       & ! Dry, static density on momentum levels      [kg/m^3]
invrs_rho_ds_zt    ! Inv. dry, static density at thermo. levels  [m^3/kg]

Those density variables are based on CAM's pressure fields. If the density fields have NaNs, then the solution has probably gone bad long before reaching this time step.

adamrher commented 2 years ago

@xhuang-ncar I would print out rho_ds_zt at the clubb_intr level, here: https://github.com/ESCOMP/CAM/blob/3676c6ec1c8dfd19e20ea764c0226792574481f0/src/physics/cam/clubb_intr.F90#L2263

And condition the write statements on NaNs. I'd also print out the components state1%pdel(i,pver-k+1) and dz_g(pver-k+1). This will confirm that the NaNs are being carried into CLUBB via state%, and so must be generated further upstream in the CAM time-loop, as Vince suggests.

xhuang-ncar commented 2 years ago

@xhuang-ncar I would print out rho_ds_zt at the clubb_intr level, here:

https://github.com/ESCOMP/CAM/blob/3676c6ec1c8dfd19e20ea764c0226792574481f0/src/physics/cam/clubb_intr.F90#L2263

And condition the write statements on NaNs. I'd also print out the components state1%pdel(i,pver-k+1) and dz_g(pver-k+1). This will confirm that the NaNs are being carried into CLUBB via state%, and so must be generated further upstream in the CAM time-loop, as Vince suggests.

I see. Before doing that, I am curious can we reset those NaN values and check out first if that's the cause of the error?

andrewgettelman commented 2 years ago

This might not be that relevant based on what we have been discussing, but I'm going to forward a comment from @skamaroc:

I'm looking into the CLUBB error we see in the 58 level configuration. The error message we are seeing is generated after the call to the tridiagonal solve for the vertical mixing of scalars. LAPACK is used in the tridiagonal solve, and here is the code that sets the error condition after it is called:

` select case( info ) case( :-1 ) write(fstderr,*) trim( solve_type )// & " illegal value in argument", -info err_code = clubb_fatal_error

  solution = -999._core_rknd

case( 0 )
  ! Success!                                                                                                      
  if ( lapack_isnan( ndim, nrhs, rhs ) ) then
    err_code = clubb_fatal_error
  end if

  solution = rhs

case( 1: )
  write(fstderr,*) trim( solve_type )//" singular matrix."
  err_code = clubb_fatal_error

  solution = -999._core_rknd

end select

return

end subroutine tridag_solve `

Based on this code, it would appear that the LAPACK tridiagonal solve is returning NaNs in at least one column because it would have printed out an error message here for the other two error conditions ("singular matrix" or "illegal value"). A step we could take is to instrument the NaN check to tell us where this is occuring, and perhaps print out the inputs to the tridiagonal solve for that column. This might lead to bounding those inputs in some (perhaps physical) manner upstream of the tridiagonal solve. We're down in the weeds here and we would likely need help from Vince and his group to do this.

I also noticed CLUBB is using Crank-Nicholson in the eddy-mixing integration here. This is centered in time, and while second-order accurate in time it can lead to some odd oscillatory behavior for large eddy viscosities (large relative to the timestep and vertical mesh spacing). I didn't see an option for weighting the time levels differently (i.e. biasing the implicit contribution that while only first-order in time is better behaved for larger eddy viscosities). I do not know if large eddy viscosities are causing the problem. Any insights here?

adamrher commented 2 years ago

@andrewgettelman it could be that Bill is ahead of us here, but I think we should press forward to see if we can't detect NaNs at the clubb_intr level for rho_ds_zt like I suggested. If we can't detect them at clubb_intr then we should definitely try to replicate Bill's debugging in the vertical mixing of scalars in the clubb dycore.

andrewgettelman commented 2 years ago

Thanks @adamrher, I would agree with that assessment. We could print the values out as you suggest, and maybe reset them as well as @xhuang-ncar notes.

I'm guessing it will just blow up somewhere else if the cause is 'upstream' if CLUBB.

I wonder if (A) Xingying has tried this in debug mode to try to trap for the NaN's if that works? (B) we should not get some help in stoping the model at the timestep before and restarting with the debugger to see where the NaN's first appear.

But I guess first we should do as you both suggest and verify that CLUBB is getting bad input data.

xhuang-ncar commented 2 years ago

@andrewgettelman it could be that Bill is ahead of us here, but I think we should press forward to see if we can't detect NaNs at the clubb_intr level for rho_ds_zt like I suggested. If we can't detect them at clubb_intr then we should definitely try to replicate Bill's debugging in the vertical mixing of scalars in the clubb dycore.

The NaNs are not detected at clubb_intr and neither before the windm_edsclrm_rhs routine.

andrewgettelman commented 2 years ago

Comments from @zarzycki :

We have seen similar issues in both our CLUBB CPT (over land, issues with PDC errors even with L32) and issues in E3SM (L72) with instabilities that are eliminated by changing the coupling and thickening the near surface layer...

Julio can probably fill you in on what we've seen in the CPT, but essentially we've mitigated some issues with improving the coupling. I have a branch of code that updates surface stress from CLM inside of the CLUBB loop -- I think Adam H. is working on a parallel fix that rearranges the BC/AC physics, which might be a preferable solution if it works and doesn't totally mess up the climate?

However, we still have some instabilities when the lowest model layer gets too thin. This seems to be due to interactions w/ the surface layer/fluxes and CLUBB diffusivities and can occur over ocean, too, which implies it may not be the exact same mechanism as the PDC errors from before. Just nuking the bottom layer (e.g., going from L72 to L71 in E3SM) seems to greatly alleviate issues, although not an optimal solution...

vlarson commented 2 years ago

I'm thinking that it might help to determine whether there is an initialization problem.  Do scalars 9-12 look bad already at the first time step?  Perhaps some plots comparing the run that crashes and a run that works would be illuminating.

It might also be useful to know whether the problem occurs near the lower surface or aloft.  If it's near the lower surface, the problem might have to do with either fine grid spacing, coupling with the surface, or surface emissions.  If aloft, it might help to reduce the time step.  Again, some plots might help.

adamrher commented 2 years ago

I thought we has some action items from the mtg last Tuesday? We discussed trying to comment out the dry deposition code on account of this DEBUG=TRUE FPE error (see below). If you look at the actual line that's triggering the FPE (/glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110), lots could be going wrong. And then we determined that the RHS variable is all NaNs in clubb's scalar mixing subroutine, and so some further debugging there would be helpful too (perhaps as Vince suggests, pinpointing where in the column a bad value is triggering a whole column of NaNs). @xhuang-ncar any updates?

825:MPT: #1 0x00002b55ec5dc306 in mpi_sgi_system ( 825:MPT: #2 MPI_SGI_stacktraceback ( 825:MPT: header=header@entry=0x7ffeb377a050 "MPT ERROR: Rank 825(g:825) received signal SIGFPE(8).\n\tProcess ID: 64229, Host: r5i4n31, Program: /glade/scratch/xyhuang/mass-scaling-f2000-climo-ca-v7-2304/bld/cesm.exe\n\tMPT Version: HPE MPT 2.22 03"...) at sig.c:340 825:MPT: #3 0x00002b55ec5dc4ff in first_arriver_handler (signo=signo@entry=8, 825:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b55f6c80080) at sig.c:489 825:MPT: #4 0x00002b55ec5dc793 in slave_sig_handler (signo=8, siginfo=, 825:MPT: extra=) at sig.c:565 825:MPT: #5 825:MPT: #6 0x000000000a554001 in __libm_log_l9 () 825:MPT: #7 0x0000000002ce9481 in mo_drydep::drydep_xactive (sfc_temp=..., 825:MPT: pressure_sfc=..., wind_speed=..., spec_hum=..., air_temp=..., 825:MPT: pressure_10m=..., rain=..., snow=..., solar_flux=..., dvel=..., dflx=..., 825:MPT: mmr=<error reading variable: value requires 185600 bytes, which is more than max-value-size>, tv=..., ncol=16, lchnk=12, ocnfrc=..., icefrc=..., 825:MPT: beglandtype=7, endlandtype=8) 825:MPT: at /glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110

zarzycki commented 2 years ago

I spoke very briefly with @vlarson about this yesterday and wanted to share some detail that may be helpful.

A.) I am certainly a bit gunshy, but we did have a bit of a red herring issue a few years ago where I was convinced there was an error in CLM5 + updated CAM-SE, but it really was just a better error trap from CLM's side. Chasing NaN's and stack traces led me down the wrong path -- it was only once I started dumping all state fields every timestep that I was able to pinpoint the regime causing the blowups.

B.) At the head of the E3SM repo, there have been modifications to CLUBB tuning compared to v1. These changes seem to have resulted in situations with "noise," even in vertically integrated fields. For example, @jjbenedict is running some VR-E3SM simulations with me for a DoE project and noted noise in the precip field of Hurricane Irene hindcasts (below) with dev code at E3SM head (2 subtly different tuning configs). A quick look at the published v1 data showed either no (or at least much less) noise, which seems to imply a response to changes in CLUBB tuning.

NOTE: while we are focused on the TC, you can also see some noise in the top right, which means this may not be 100% a TC response. NOTEx2: this noise is also apparent in at least some CLUBB moments -- I don't have any of these runs anymore, but distinctly remembering seeing ~2dx mode in upwp.

Irene_L72

After removing the lowest model level (i.e., merging lev(71) and lev(70) from a 0-based perspective), this raises the lowest model level from ~15m to ~40m (still lower than CAM6's L32...). Just "thickening" the lowest model level eliminates such instability, even with everything else (init, config, etc.) identical as before (again, the only change is L72 -> L71). See below precip snapshot.

Irene_L71

Anecdotally, we have also had crashes (generally over cold mountainous areas) with L72 that go away with L71. Haven't followed up too closely, however.

My (completely unsubstantiated) hypothesis is the both CLUBB and the dycore are essentially providing vertical diffusion to the model -- whether that be considered implicit or explicit. Generally, this isn't an issue with thicker layers, but thinner layers have less mass/inertia and are more susceptible to CFL errors, so it's possible they are also more susceptible to the combined diffusion from the dynamics + physics. This may also help explain some of the horizontal resolution sensitivity, as the dycore's explicit diffusion will scale with resolution, so some of the power in the smaller scales + sharper gradients will still modulate the vertical diffusion, even in the absence of explicit changes to the vertical coord.

Anyway, my (super stupid easy test) would be to run a similar case with MPAS to the one that crashes but with a thicker lowest level -- everything else (horiz res, init, forcing, config) remains identical. Merging the bottom two levs was the easiest test for me, but I suppose just shifting everything up a bit probably would work. If the model follows the same stability trajectory, we are right where we started, but if it doesn't, it provides a more focused target for debugging.

skamaroc commented 2 years ago

Question for @vlarson: I noticed that the vertical dissipation in CLUBB is implemented using semi-implicit Crank-Nicolson time integration. This can produce oscillatory behavior for large eddy viscosities (i.e. large K dt/dz^2). Is there a way we can change the weights on the time levels in the integration? It looks like the weights (1/2, 1/2) are hardwired in the code. It may be that the vertical discretization in the nonhydrostatic solver in MPAS, that uses a Lorenz vertical staggering, is not happy with oscillations that may be produced by the mixing in these thin layers. I'm also wondering about potential feedback from the mixing and the nonlinear computation of the eddy viscosities.

JulioTBacmeister commented 2 years ago

I recently heard from Adam H that when the model is compiled with debugging ON, the crash is seen to originate in the dry deposition code, not in CLUBB. There is a particular line flagged that involves calculations with surface pressure. Am I misunderstanding?

On Fri, Oct 22, 2021 at 10:06 AM skamaroc @.***> wrote:

Question for @vlarson https://github.com/vlarson: I noticed that the vertical dissipation in CLUBB is implemented using semi-implicit Crank-Nicolson time integration. This can produce oscillatory behavior for large eddy viscosities (i.e. large K dt/dz^2). Is there a way we can change the weights on the time levels in the integration? It looks like the weights (1/2, 1/2) are hardwired in the code. It may be that the vertical discretization in the nonhydrostatic solver in MPAS, that uses a Lorenz vertical staggering, is not happy with oscillations that may be produced by the mixing in these thin layers. I'm also wondering about potential feedback from the mixing and the nonlinear computation of the eddy viscosities.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CAM/issues/442#issuecomment-949765792, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGLMTXMIUFDQRRW36MAAVDUIGDZRANCNFSM5FFNKEQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

andrewgettelman commented 2 years ago

@adamrher, can you clarify @JulioTBacmeister 's comment... might it be coming from surface pressure calculations in the dry dep code? Thanks!

adamrher commented 2 years ago

@andrewgettelman my earlier comment seems to have gotten lost in the thread:

I thought we has some action items from the mtg last Tuesday? We discussed trying to comment out the dry deposition code on account of this DEBUG=TRUE FPE error (see below). If you look at the actual line that's triggering the FPE (/glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110), lots could be going wrong.

And here is the line

   cvarb  = vonkar/log( z(i)/z0b(i) )

where z is an elevation derived from hydrostatic balance, and so uses the surface pressure.

In parallel, I think we should also consider @skamaroc point, that the time integration in CLUBB eddy diffusivity may be producing oscillations that does not play well with the MPAS solver. On our end, I suggest we continue with the other action item from last weeks mtg:

And then we determined that the RHS variable is all NaNs in clubb's scalar mixing subroutine, and so some further debugging there would be helpful too (perhaps as Vince suggests, pinpointing where in the column a bad value is triggering a whole column of NaNs). @xhuang-ncar any updates?