E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
343 stars 352 forks source link

Divide by zero on KNL in lapack/blas with many F compsets (OK without -fpe0) #1183

Closed ndkeen closed 1 year ago

ndkeen commented 7 years ago

In debug, we throw the -fpe0 flag for intel builds which will catch certain floating-point exceptions, including divide by zero. I've seen this for a while, but my solution has been to simply remove this flag and carry on. In past work, I've seen this flag cause problems within a solver and so we would disable it from within the code before calling solvers. However, maybe someone already knows about this, wants to know, or has a fix.

It looks like it happens with all tests like this, here is one example: SMS_D_Ld1.ne16_ne16.FC5ATMMOD.cori-knl_intel

60: forrtl: error (73): floating divide by zero
60: Image              PC                Routine            Line        Source
...
60: acme.exe           00000000007D0182  atm_comp_mct_mp_a         341  atm_comp_mct.F90
60: acme.exe           000000000044C5DC  component_mod_mp_         227  component_mod.F90
60: acme.exe           0000000000428DE3  Unknown               Unknown  Unknown
60: acme.exe           0000000000443508  MAIN__                     62  cesm_driver.F90
60: acme.exe (deleted  000000000040A8DE  Unknown               Unknown  Unknown
mt5555 commented 7 years ago

in the latest ACME master, line 341, atm_comp_mct.F90 is: call cam_run1 ( cam_in, cam_out )

so the floating point divide by zero must be lower down in the call stack

ndkeen commented 7 years ago

Yea, Erich pointed out this as well -- I omitted many lines that just said Unknown where the ... is. At one pointed, I recall tracking this down further and found that it was in a lapack/scalapack call. I no longer remember how I did that. I will try to update with more information. I might have used a debugger. Just to be clear, with -fpe0, all of the F cases I've run debug will stop for me, while removing this gets around this. And this is happening on cori-knl, but I'm pretty sure it also happens on cori-haswell. That is, I think the issue is happening with Intel v17.

ndkeen commented 7 years ago

OK, I ran another one and it looks like some tasks are pointing directly to lapack call.

56: Image              PC                Routine            Line        Source             
56: acme.exe           000000000A461F21  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000A46005B  Unknown               Unknown  Unknown
56: acme.exe           000000000A40E9D4  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000A40E7E6  Unknown               Unknown  Unknown
56: acme.exe           000000000A38CF06  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000A3990D9  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000A045DA0  Unknown               Unknown  Unknown
56: acme.exe           00000000088F91B8  Unknown               Unknown  Unknown
56: acme.exe           00000000085B7DF3  Unknown               Unknown  Unknown
56: acme.exe           0000000008477C02  Unknown               Unknown  Unknown
56: acme.exe (deleted  00000000083C19DE  Unknown               Unknown  Unknown
56: acme.exe           0000000003A0AAC5  lapack_wrap_mp_tr  
56:        269  lapack_wrap.F90
56: acme.exe (deleted  00000000036825CD  Unknown               Unknown  Unknown
56: acme.exe           000000000367DE98  Unknown               Unknown  Unknown
56: acme.exe           000000000363DB01  Unknown               Unknown  Unknown
56: acme.exe (deleted  00000000020E5541  Unknown               Unknown  Unknown
56: acme.exe (deleted  0000000000CF459C  Unknown               Unknown  Unknown
56: acme.exe           0000000000CCEB96  Unknown               Unknown  Unknown
56: acme.exe (deleted  00000000008008B4  Unknown               Unknown  Unknown
56: acme.exe (deleted  00000000007D0182  Unknown               Unknown  Unknown
56: acme.exe           000000000044C5DC  Unknown               Unknown  Unknown
56: acme.exe           0000000000428DE3  Unknown               Unknown  Unknown
56: acme.exe           0000000000443508  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000040A8DE  Unknown               Unknown  Unknown
56: acme.exe           000000000A47EBF0  Unknown               Unknown  Unknown
56: acme.exe
56:            000000000040A7C7  Unknown               Unknown  Unknown

which is:

      call dgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  & 
                  rhs, ndim, info )
ndkeen commented 7 years ago

I verified that cori-haswell with Intel v17 is OK with this. It's only hitting a floating-point issue on KNL's. As stated, I suspect the issue is within the low-level solver.

ndkeen commented 7 years ago

I also verified that simply backing intel compiler version to 16 does not make this go away.

ndkeen commented 7 years ago

Here is another div-by-zero that was caught with Intels -fpe0 on Cori. This one has more information.

13: forrtl: error (73): floating divide by zero
13: Image              PC                Routine            Line        Source             
13: acme.exe           000000000A113D21  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000A111E5B  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000A0C07D4  Unknown               Unknown  Unknown
13: acme.exe           000000000A0C05E6  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000A03ED06  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000A04AED9  Unknown               Unknown  Unknown
13: acme.exe (deleted  0000000009CF7BA0  Unknown               Unknown  Unknown
13: acme.exe (deleted  00000000085AAFB8  Unknown               Unknown  Unknown
13: acme.exe           0000000008269BF3  Unknown               Unknown  Unknown
13: acme.exe           0000000008129952  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000807372E  Unknown               Unknown  Unknown
13: acme.exe           000000000384214D  lapack_wrap_mp_tr  
13:        269  lapack_wrap.F90
13: acme.exe           000000000354400D  advance_windm_eds        1141  advance_windm_edsclrm_module.F90
13: acme.exe           000000000353F8D8  advance_windm_eds         486  advance_windm_edsclrm_module.F90
13: acme.exe           00000000034FF541  advance_clubb_cor        1940  advance_clubb_core_module.F90
13: acme.exe           00000000020C8E99  Unknown               Unknown  Unknown
13: acme.exe           0000000000CDCAA0  physpkg_mp_tphysb        2362  physpkg.F90
13: acme.exe           0000000000CB709A  physpkg_mp_phys_r        1008  physpkg.F90
13: acme.exe (deleted  00000000008008B4  Unknown               Unknown  Unknown
13: acme.exe (deleted  00000000007D0182  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000044C5DC  Unknown               Unknown  Unknown
13: acme.exe           0000000000428DE3  Unknown               Unknown  Unknown
13: acme.exe           0000000000443508  MAIN__                     62  cesm_driver.F90
13: acme.exe           000000000040A8DE  Unknown               Unknown  Unkn
13: own
13: acme.exe           000000000A1309F0  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000040A7C7  Unknown               Unknown  Unknown

/global/cscratch1/sd/ndk/acme_scratch/mndk-knl/SMS_D_Ln5.ne16_ne16.FC5AV1C.cori-knl_intel.20170116_132036

ndkeen commented 7 years ago

I don't seem to hit this issue on the haswell nodes, only the KNL nodes. So I will look closer at the external libs I am linking in to see if that could be a culprit.

For the SMS_D_Ln5.ne16_ne16.F1850C5AV1C-04.cori-knl_intel problem, I rebuilt using craype-mic-knl module and then linked with a simple -mkl instead of the current way we link in MKL. This was just a test to make sure I'm linking with correct/best MKL, although I've been reassured the way we are doing it is also fine. I still get same divide-by-zero.

Worked with Stephen Leak at NERSC on this yesterday. We tried a few things, but no solutions yet. The test will work with GNU, but a) this forces the use of libsci, not MKL and b) apparently our GNU builds to not throw any floating-point trap flags. I tried adding one (-ffpe-trap=zero) and didn't get anything, however, I would want to make sure it's doing what I want.

ndkeen commented 7 years ago

It was suggested that I try the executable that we build for haswell nodes on the KNL nodes. Easy test to learn more about the issue.

I just did this (simply copying an executable that was built/run successfully on haswell nodes over to my bld/ directory) and I see the same divide-by-zero error.

I've tried several other minor experiments, without a good solution. For example, I verified I can build optimized, add the -fpe0 flag, and still hit the same error. I've also tried other MKL linking methods, all with same result. I've used this tool to ensure we have the right flags and tinker with them. https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

ndkeen commented 7 years ago

I exchanged emails with @vlarson and he suggested I try to change the level of debug in the code. Change two clubb_at_least_debug_level( 2 ) to be clubb_at_least_debug_level( 0 ) in advance_clubb_core_module.F90 and then make a similar change (going from 1 to 0) in advance_windm_edsclrm_module.F90. I still saw the same behavior when compiled with -fpe0 (divide by zero error without addl info). I then recompiled the ATM without the -fpe0 flag, the code completed, and I don't see any addl warning messages in the log files.

Just a snip of the code where it is stopping (it's calling tridag_solve):

   ! Solve tridiagonal system for xm.                                                                                                             
   if ( l_stats_samp .and. ixm_matrix_condt_num > 0 ) then
     call tridag_solvex &
          ( "windm_edsclrm", gr%nz, nrhs, &                            ! Intent(in)                                                               
            lhs(kp1_tdiag,:), lhs(k_tdiag,:), lhs(km1_tdiag,:), rhs, & ! Intent(inout)                                                            
            solution, rcond, err_code )                                ! Intent(out)                                                              

     ! Est. of the condition number of the variance LHS matrix                                                                                    
     call stat_update_var_pt( ixm_matrix_condt_num, 1, 1.0_core_rknd/rcond, &  ! Intent(in)                                                       
                              stats_sfc )                                      ! Intent(inout)                                                    
   else

     call tridag_solve( "windm_edsclrm", gr%nz, nrhs, &                             ! In                                                          
                        lhs(kp1_tdiag,:),  lhs(k_tdiag,:), lhs(km1_tdiag,:), rhs, & ! Inout                                                       
                        solution, err_code )                                        ! Out                                                         
   end if

And in subroutine tridag_solve of lapack_wrap.F90, This is where it leaves ACME, presumably goes into lapack and hits divide-by-zero

   if ( kind( diag(1) ) == dp ) then
     call dgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  &   ! ndk line where we leave ACME and get divide-by-zero with intel on KNL     
                 rhs, ndim, info )

   else if ( kind( diag(1) ) == sp ) then
     call sgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  &
                 rhs, ndim, info )

   else
…
ndkeen commented 7 years ago

I was able to ask this question directly to Martyn Corden (Intel Compiler guy) who had some suggestions and gave us a possible theory as to why this might happen only on KNL's. In my mind, his solution is the same as what I had originally suggested, which is to find a way to turn off floating-point exception trapping for math library calls such as lapack. I did try one of his suggestions to give all fortran builds "-fpe-all=0" (replacing "-fpe0") and then for the one fortran file that makes the offending call, compile without this flag (did this using Depends.*). And this "worked" -- ie code is running well beyond where it was stopping. However, it seems to be running painfully slow and therefore may not be the ultimate solution. Next I would like to try unmasking floating-point exceptions JUST for certain calls using the API. I have examples of this for C/C++, but not fortran.

This is basically suggesting that there may not be a problem with our code or the data going into the LAPACK call. We just have to find a work-around if we want to continue using -fpe0 for DEBUG builds.

I'm pasting in the email from Martyn.

Noel,
              It sounds like a divide-by-zero is generated inside LAPACK, probably by the vectorizer, e.g. from code something like this:
   DO J=1,N
      IF(A(J).GT.0.)  B(J) = 1.0/A(J)
   ENDDO

The vectorizer assumes the default floating-point environment, in which floating-point exceptions are masked.
It may choose to calculate 1./A(J) for all values of J, but store into B(J) only for those values of J for which A(J)>0. So any values A(J)=0. might cause a divide-by-zero exception if exceptions are unmasked.
-fpe0 causes exceptions to be unmasked once at the start of the main program. They remain unmasked for the entire application unless you make a call to mask them again.

You say this happens only on KNL. A perhaps more likely variation is 
   DO J=1,N
       B(J) = 1.0/A(J)        ! double precision
   ENDDO

Suppose N=1607, for example. Iterations 1-1600 will be performed in a vectorized kernel.
Iterations 1601-1607 will be performed in a separate remainder loop. On KNL, the compiler may decide this is worth vectorizing. It may perform the division for values of J from 1601 to 1608, (even if it is above the upper array bound), but mask off the last element in the store to B. If the memory location corresponding to A(1608) contains zero, this may result in a divide by zero exception.

One solution would be to compile LAPACK with -fp-speculation safe. This would work for user code, but LAPACK is in MKL and you don't have the source.

The other solution is for exceptions to be masked when you call LAPACK, but unmasked for the rest of the application you are trying to debug. The quick and dirty way to do this is instead of using -fpe0, (which only affects the Fortran main program, nothing else), to use -fpe-all=0  for all source files except those that contain calls to LAPACK. That unmasks exceptions at the star of each routine in those source files, and masks them again at the exit from each routine. This might have some impact on performance, but that probably wouldn't matter if you were debugging.

The more precise way to do this is with explicit calls to the IEEE_EXCEPTIONS module of Fortran2003.
You can still build with -fpe0, so that the application starts with exceptions unmasked.
Before each call to LAPACK, you mask them, and after the return, you unmask them again.

Here's an example, (with an explicit divide by zero, not a vectorizer speculation):

program test
 integer, parameter :: N=16
 real, dimension(N) :: x=(/(mod(i,8),i=1,N)/), y

 call sub1(x,y,N)
 print *, 'Y = ', y(4:N:4)
 call sub2(x,y,N)
 print *, 'Y = ', y(4:N:4)
end program test

subroutine sub1(x,y,N)
 use, intrinsic :: ieee_arithmetic
 real, dimension(N) :: x,y
 call ieee_set_halting_mode(ieee_usual, .false.)  
 call sub2(x,y,N)
 call ieee_set_halting_mode(ieee_usual, .true.)
end

subroutine sub2(x,y,N)
 real, dimension(N) :: x,y
 y = 1. / x  
end

ifort -fpe0 -xmic-avx512 -fno-inline -traceback test_div.f90; ./a.out
Y =   0.2500000           Infinity  0.2500000           Infinity
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
a.out              000000000047D301  Unknown               Unknown  Unknown
a.out              000000000047B43B  Unknown               Unknown  Unknown
a.out              0000000000448284  Unknown               Unknown  Unknown
a.out              0000000000448096  Unknown               Unknown  Unknown
a.out              0000000000427839  Unknown               Unknown  Unknown
a.out              0000000000403B39  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F3011615100  Unknown               Unknown  Unknown
a.out              0000000000402F52  sub2_..0                   21  test_div.f90
a.out              0000000000402B7D  MAIN__                      7  test_div.f90
a.out              0000000000402A5E  Unknown               Unknown  Unknown
libc-2.17.so       00007F3011062B15  __libc_start_main     Unknown  Unknown
a.out              0000000000402969  Unknown               Unknown  Unknown
Aborted (core dumped)

The divide-by-zero in the first call to sub2 is protected, but the second is not.

Actually, I just came up with a slightly better example, which I am attaching.
In this case, the exception is due to speculation and it can be suppressed by -fp-speculation safe.
Note that -fpe0 itself sets -speculation safe, to avoid this issue, so that if your whole application is compiled with -fp-speculation safe, there is no problem. Your problem arises because your user code is compiled with -fpe0 but LAPACK inside MKL is not.
  Build and run the attached example:

$  ifort -c -xmic-avx512 -traceback sub2.f90
$  ifort -fpe0 -xmic-avx512 -traceback test_div0.f90 sub2.o; ./a.out
Y =   0.2500000      0.0000000E+00  0.2500000      0.0000000E+00
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
a.out              000000000047D371  Unknown               Unknown  Unknown
a.out              000000000047B4AB  Unknown               Unknown  Unknown
a.out              00000000004482F4  Unknown               Unknown  Unknown
a.out              0000000000448106  Unknown               Unknown  Unknown
a.out              00000000004278A9  Unknown               Unknown  Unknown
a.out              0000000000403BA9  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F7EB1989100  Unknown               Unknown  Unknown
a.out              0000000000402F98  sub2_                       4  sub2.f90
a.out              0000000000402C03  MAIN__                      7  test_div0.f90
a.out              0000000000402A5E  Unknown               Unknown  Unknown
libc-2.17.so       00007F7EB13D6B15  __libc_start_main     Unknown  Unknown
a.out              0000000000402969  Unknown               Unknown  Unknown
Aborted (core dumped)

You can experiment with commenting or uncommenting the calls to ieee_set_halting_mode or the compiler switches -fpe0 and -fp-speculation safe to see their effect.
ndkeen commented 7 years ago

Example I found online of how to use the ieee extensions. Testing this now. Not sure if it needs to be behind guards for certain compilers that cannot support it yet.

!http://fortran71.rssing.com/chan-21772141/all_p35.html 
! ftn -fpp -fpe0 ieee_div_by_zero.f90
program test

   use, intrinsic :: ieee_exceptions
   implicit none

   integer, parameter :: RKIND = 16

   real(RKIND) :: a, b, c
   logical     :: ieee_flag_is_raised

   a = 1.0_RKIND
   b = 0.0_RKIND
   !
   ! Division by zero allowed
   !
   call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .false.)
   c = a / b
   ieee_flag_is_raised = .false.
   call ieee_get_flag(IEEE_DIVIDE_BY_ZERO, ieee_flag_is_raised)
   print *, "ieee_flag_is_raised=", ieee_flag_is_raised
   print *, "c=", c
   !
   ! Division by zero forbidden
   !
   call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .true.)
   c = a / b
   call ieee_get_flag(IEEE_DIVIDE_BY_ZERO, ieee_flag_is_raised)
   print *, "ieee_flag_is_raised=", ieee_flag_is_raised
   print *, "c=", c

end program test
ndkeen commented 7 years ago

Looks like this is working in real acme code. Will make a PR.

singhbalwinder commented 7 years ago

This is great! Thanks Noel for sorting this out.

I have seen issues like this before where libraries were compiled using different flags than the source code which is using them. In most of those cases, the compiler either won't allow me to use those flags or issue warnings about it. Were there any warnings regarding this in the atm.bldlog.* file?

I must be missing something as it seems like the proposed solution of setting IEEE_DIVIDE_BY_ZERO will cause the compiler to skip legitimate divide by zeros also. Is that the case?

ndkeen commented 7 years ago

Yes, all this is doing is telling that particular LAPACK call (which is part of Intels MKL library) to NOT catch any floating-point issues (in DEBUG mode). We could make the case that the MKL library should behave better, but we don't have control over that code (or how it is built). If you read what Martyn was saying, it makes sense how this could only happen on KNL's and not actually be a real math issue. Note the library still has checks, but would not fail on this.

We could just add this for cori-knl/intel. Or let the tests continue to fail in DEBUG until another solution is found.

singhbalwinder commented 7 years ago

Thanks. I re-read the example above and you are right. I think doing this only for Cori-KNL/Intel makes sense to me. If we see this issue on other platforms, we can look into it further.

ndkeen commented 7 years ago

It also looks like I can use a slightly different formulation and get the same results. I'm asking what the difference is. It may be that this is better.

use, intrinsic :: ieee_arithmetic
…
     call ieee_set_halting_mode(ieee_usual, .false.)  
     call dgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  &   ! ndk line where we leave ACME and get divide-by-zero with intel on KNL
                 rhs, ndim, info )
     call ieee_set_halting_mode(ieee_usual, .true.)  
…

(whoops, i tested incorrectly and trying again -- i turned it off, then off again, instead of back on)

mt5555 commented 7 years ago

should this be inside an #ifdef DEBUG?

in non-debug mode, no reason to call extra functions?

ndkeen commented 7 years ago

Yes, you are right -- it should at least be behind #ifdef DEBUG. But should it be behind more is my question (ie just for certain compilers/machines),

ndkeen commented 7 years ago

How does this sort of code look?

#ifndef NDEBUG
#ifdef ARCH_MIC_KNL
      ! when floating-point exceptions are turned on, this call was failing with a div-by-zero on KNL with Intel/MKL. Solution 
      ! was to turn off exceptions only here at this call (and only for machine with ARCH_MIC_KNL defined) (github 1183)
      call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .false.) ! Turn off stopping on div-by-zero only
      !call ieee_set_halting_mode(ieee_usual, .false.)  ! Turn off stopping on all floating-point exceptions
#endif
#endif

I add -DARCH_MIC_KNL to the CPPDEFS for cori-knl. In this way, it only happens for DEBUG builds on this machine. I could also test for the Intel compiler? I did test that these routines work for GNU and they do. I think GNU 5.1 and above (at least) have the ieee_instrinsics.

I have tested this on acme_developer and all of the affected tests pass.

I also exchanged more emails with Martyn which I will paste here again just to keep it all in one place.


Hi Noel,
            The difference between the examples was that one masked only divide-by-zero exceptions, whereas the other masked also floating-invalid and overflow exceptions in addition. So it would also protect against things like log(-1.)   or 0./0. 

            No, MKL is not misbehaving. It, via the compiler, assumes the default floating-point environment. If you change the default floating-point environment, e.g. by changing the rounding mode or by unmasking exceptions, you may get unexpected results.
This applies to all platforms, not just KNL, even though it only happened on KNL for your particular case.
If MKL had to save, set and then restore the floating-point environment for every call, there could be a significant performance impact. 

        Masking exceptions before a library call and re-enabling afterwards sounds like something you might put under a DEBUG flag or macro?

I replied:

OK, your explanation makes sense.  Thanks.
And I have a fix in place that will turn off this FP checking for that one call, for this one machine/compiler.

However, I still have a nagging concern. Which others on the team have also hinted at.
Is there indeed a “real” division by zero or not?  (that our data being sent to that algorithm caused)
I’m wondering if perhaps there should not be, yet the way in which MKL behaves, it is
saying that there is one.  Now, if there really is a div-by-zero, then by all means we
should try to debug and fix.  But this is a test that is part of the regression tests
and is run by many different machines/compliers, etc.

I’ve looked at the nominal source code for the dgtsv routine and I see checks in place for
some divisions, but not for others.

And he said:

I would say that there is no evidence of a "real" divide by zero inside MKL.
You can't distinguish directly between a "real" or a "speculated" divide by zero without recompiling that part of MKL.
However, speculated divides-by-zero are expected whereas real divides-by-zero are not.
If a real divide-by-zero were occurring as part of the main computation, I would expect that MKL  would return an error flag (such as for a singular matrix) and/or the results would contain infinities or NaNs. Once you have an infinity as an intermediate result, it's hard to make it go away, even multiplying by zero only makes it a NaN.

There are Fortran 2003 intrinsic functions such as ieee_is_nan() and ieee_is_finite() to let you test the returned data.
You should be able to adapt my little test program to try them out.
singhbalwinder commented 7 years ago

It looks good to me! I think ACME already needs GNU 5.0 or later version (https://acme-climate.atlassian.net/wiki/display/SE/Configuration+Management?focusedCommentId=7373315#comment-7373315) so we should be good there.

amametjanov commented 6 years ago

Okay, I see this issue being referenced in the source, so resurrecting it. In non-debug (release) mode, and additional compiler flags

ifeq ($(DEBUG), FALSE)
   FFLAGS +=  -O2 -debug minimal -qno-opt-dynamic-align -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created -check format -check output_conversion
   CFLAGS +=  -O2 -debug minimal
endif

ne120-wcycl run (SMS_Ld5.ne120_oRRS18v3_ICG.A_WCYCL1950S_CMIP6_HR.theta_intel.cam-cosplite with COSP and high-frequency IO) got into:

(seq_domain_areafactinit) : min/max drv2mdl   0.999999923354153       1.00000007509469    areafact_o_OCN
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
e3sm.exe           000000000227ACE6  lapack_wrap_mp_tr         282  lapack_wrap.F90
e3sm.exe           000000000210D56E  advance_windm_eds        1141  advance_windm_edsclrm_module.F90
e3sm.exe           000000000210623E  advance_windm_eds         486  advance_windm_edsclrm_module.F90
e3sm.exe           00000000020EBA88  advance_clubb_cor        2019  advance_clubb_core_module.F90
e3sm.exe           00000000016ABC08  clubb_intr_mp_clu        1855  clubb_intr.F90
e3sm.exe           000000000084D44A  physpkg_mp_tphysb        2461  physpkg.F90
e3sm.exe           0000000000846C2B  physpkg_mp_phys_r        1027  physpkg.F90
e3sm.exe           00000000005FC48D  cam_comp_mp_cam_r         250  cam_comp.F90
e3sm.exe           00000000005E8B00  atm_comp_mct_mp_a         348  atm_comp_mct.F90
e3sm.exe           0000000000439581  component_mod_mp_         267  component_mod.F90
e3sm.exe           0000000000423FF5  cime_comp_mod_mp_        1958  cime_comp_mod.F90
e3sm.exe           0000000000430BFC  MAIN__                     92  cime_driver.F90

And the surrounding source is

269 !-----------------------------------------------------------------------
270 !       *** The LAPACK Routine ***
271 !       SUBROUTINE DGTSV( N, NRHS, DL, D, DU, B, LDB, INFO )
272 !-----------------------------------------------------------------------
273 
274     if ( kind( diag(1) ) == dp ) then
275 #ifndef NDEBUG
276 #if defined(ARCH_MIC_KNL) && defined(CPRINTEL)
277       ! when floating-point exceptions are turned on, this call was failing with a div-by-zero on KNL with Intel/MKL. Solution 
278       ! was to turn off exceptions only here at this call (and only for machine with ARCH_MIC_KNL defined) (github 1183)
279       call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .false.) ! Turn off stopping on div-by-zero only
280 #endif
281 #endif
282       call dgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  &
283                   rhs, ndim, info )
284 #ifndef NDEBUG
285 #if defined(ARCH_MIC_KNL) && defined(CPRINTEL)
286       call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .true.) ! Turn back on stopping on div-by-zero only
287 #endif
288 #endif

So we could be getting NaNs, because of this issue. I am commenting out #ifndef NDEBUG and continuing the run to see if NaNs disappear! (NDEBUG is defined in release mode and ARCH_MIC_KNL and CPRINTEL are also defined).

ndkeen commented 6 years ago

I think the changes in the PR that addresses this issue should be put back the way they were. I'm seeing odd behavior when we both add -fpe0, or remove -fpe0 and remove the #ifndef NDEBUG.

ndkeen commented 4 years ago

I think we can close this. Certainly we run DEBUG F cases frequently.