E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
80 stars 55 forks source link

Attempt to raise negative value to fraction power in shoc.F90 `forrtl: error (65): floating invalid` with Intel v19 on cori-knl #218

Closed ndkeen closed 4 years ago

ndkeen commented 4 years ago

On cori-knl, the test ERS_D.ne11_ne11.FSCREAM-LR --compiler=intel19 is failing for me. I used the screm repo from Jan 31st.

146: forrtl: error (65): floating invalid
146: Image              PC                Routine            Line        Source             
146: e3sm.exe           000000000A0708D4  Unknown               Unknown  Unknown
146: e3sm.exe           0000000009908640  Unknown               Unknown  Unknown
146: e3sm.exe           000000000A140DEA  Unknown               Unknown  Unknown
146: e3sm.exe           000000000A14091D  Unknown               Unknown  Unknown
146: e3sm.exe           0000000002C20F86  shoc_mp_shoc_leng        1859  shoc.F90
146: e3sm.exe           0000000002BE4275  shoc_mp_shoc_main         262  shoc.F90
146: e3sm.exe           00000000018182A4  shoc_intr_mp_shoc         893  shoc_intr.F90
146: e3sm.exe           00000000017121B3  physpkg_mp_tphysb        2552  physpkg.F90
146: e3sm.exe           00000000016E54C1  physpkg_mp_phys_r        1057  physpkg.F90
146: e3sm.exe           000000000982C033  Unknown               Unknown  Unknown
146: e3sm.exe           00000000097D62DA  Unknown               Unknown  Unknown
146: e3sm.exe           00000000097D7B86  Unknown               Unknown  Unknown
146: e3sm.exe           00000000097A32D5  Unknown               Unknown  Unknown
146: e3sm.exe           00000000016E3936  physpkg_mp_phys_r        1043  physpkg.F90
146: e3sm.exe           000000000085A4B8  cam_comp_mp_cam_r         250  cam_comp.F90
146: e3sm.exe           000000000082673F  atm_comp_mct_mp_a         393  atm_comp_mct.F90
146: e3sm.exe           00000000004537BE  component_mod_mp_         257  component_mod.F90
146: e3sm.exe           0000000000423186  cime_comp_mod_mp_        2196  cime_comp_mod.F90
146: e3sm.exe           000000000044A372  MAIN__                    122  cime_driver.F90

I added following write to see the negative value

       ! Look for cloud base in this column                                                                                                                                                                                          
        if (cldin(i,k) .gt. cldthresh .and. cldin(i,k+1) .le. cldthresh) then
          ku=k
          write(*,'(a,i10,i10,es20.10)') "ndk shoc.F90 i,k, conv_vel=", i,k,conv_vel(i,k)
          conv_var=conv_vel(i,k)**(1._r8/3._r8)
        endif

146: ndk shoc.F90 i,k, conv_vel=         9        15    0.0000000000E+00
146: ndk shoc.F90 i,k, conv_vel=         9        20    0.0000000000E+00
146: ndk shoc.F90 i,k, conv_vel=         9        25    0.0000000000E+00
146: ndk shoc.F90 i,k, conv_vel=         6        14    2.3499488324E-03
146: ndk shoc.F90 i,k, conv_vel=         9        18   -1.4371556690E-02
146: ndk shoc.F90 i,k, conv_vel=         6         4    7.9753508189E-02
146: ndk shoc.F90 i,k, conv_vel=         6         6    1.3202964678E-01
146: ndk shoc.F90 i,k, conv_vel=         6         7    1.3003523103E-01
146: ndk shoc.F90 i,k, conv_vel=         1        38    6.5953228847E+00
146: ndk shoc.F90 i,k, conv_vel=         6         8    1.1537712919E-01
146: ndk shoc.F90 i,k, conv_vel=         1        39    1.8879990516E+01
146: ndk shoc.F90 i,k, conv_vel=         6         9    1.0447548906E-01

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/s04-jan31/ERS_D.ne11_ne11.FSCREAM-LR.cori-knl_intel19.ri19

The run also fails in same way with GNU compiler in DEBUG (after a compiler flag modification discussed here https://github.com/E3SM-Project/E3SM/issues/3270).

This test is not failing with current E3SM default Intel compiler version (18).

The ne4 resolution tests are all passing, but ne11, ne16, and ne30 fail in this way.

PeterCaldwell commented 4 years ago

Hey Noel, thanks for checking into this. Could you try with the current SCREAM master to make sure this didn't get fixed? I'm confused why you're having this problem since Chris Terai ran this test without it crashing...

crterai commented 4 years ago

Noel and Peter,

The ERS test that ran for me had the form: ERS.ne11_ne11.FSCREAM-LR.cori-knl_intel It ran with a master about a week old (git hash - c4c0217ed) and a most recent version of master (git hash - 6e073ad93).

It looks like there are only a couple differences between our tests. I don't know what the '_D' option at the end of 'ERS' does but that's one difference. Noel - is there a reason you had used ERS_D rather than just ERS? I also didn't specify which intel compiler to use, I just set the machine_compiler to cori-knl_intel. I'm not sure whether this could cause the error that Noel saw though.

For your reference, the location of my ne11_ne11 tests are: /global/cscratch1/sd/terai/acme_scratch/cori-knl/ERS.ne11_ne11.FSCREAM-LR.cori-knl_intel.20200203_082409_ko6k4g and /global/cscratch1/sd/terai/acme_scratch/cori-knl/ERS.ne11_ne11.FSCREAM-LR.cori-knl_intel.20200130_141227_4wyg32

singhbalwinder commented 4 years ago

_D flag turns on all debug flags for that compiler and machine. Errors such as above are caught when we turn on these flags (specifically fpe0 in this case). This could easily be one of the reasons the test is blowing up.

ndkeen commented 4 years ago

I verified same failure with scream repo as of today (feb 3rd).

create_test ERS_D.ne11_ne11.FSCREAM-LR --compiler=intel19

Further debugging, I see that the variable conv_vel(i,k) turns negative when wthv_sec(i,k) is negative. If that helps.

PeterCaldwell commented 4 years ago

@bogensch - it looks like Noel is getting negative fractional power errors in SHOC. Can you look into this? I've asked Noel to create a simple test case for you to run. Apparently this issue is only found using the gnu and intel19 compilers - not the standard... so you'll probably have to debug this on NERSC(?).

ndkeen commented 4 years ago

To reproduce:

cd cime/scripts.
create_test ERS_D.ne11_ne11.FSCREAM-LR --compiler=intel19

Which is exact restart test, so I suspect a simple smoke test will also fail same way, ie:

create_test SMS_D.ne11_ne11.FSCREAM-LR --compiler=intel19
PeterCaldwell commented 4 years ago

Ah yes... I should have just seen that in your comment above. It might be easier to debug from a create_case... submission rather than create_test kind of line, but I guess the other Peter's clever enough to make that change if needed.

ndkeen commented 4 years ago

I verified that SMS_D.ne11_ne11.FSCREAM-LR --compiler=intel19 also fails.

As does SMS_D_PMx1.ne11_ne11.FSCREAM-LR --compiler=intel19. (runs with only 1 thread)

I also have a simple launch script (that includes create_newcase command) if that helps.

bogensch commented 4 years ago

I reproduced the error too. I will look into it.

From: noel notifications@github.com Reply-To: E3SM-Project/scream reply@reply.github.com Date: Monday, February 3, 2020 at 5:56 PM To: E3SM-Project/scream scream@noreply.github.com Cc: "Bogenschutz, Peter Andrew" bogenschutz1@llnl.gov, Mention mention@noreply.github.com Subject: Re: [E3SM-Project/scream] Attempt to raise negative value to fraction power in shoc.F90 forrtl: error (65): floating invalid with Intel v19 on cori-knl (#218)

I verified that SMS_D.ne11_ne11.FSCREAM-LR --compiler=intel19 also fails. And I have a simple launch script (that includes create_newcase command) if that helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/E3SM-Project/scream/issues/218?email_source=notifications&email_token=AF7SOHE3ACB7R3SGJBLGUHTRBDDLVA5CNFSM4KPLNMC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKWB5OA#issuecomment-581705400, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF7SOHAORQDB36HZOIWTNU3RBDDLVANCNFSM4KPLNMCQ.