Closed ndkeen closed 4 years ago
Hey Noel, thanks for checking into this. Could you try with the current SCREAM master to make sure this didn't get fixed? I'm confused why you're having this problem since Chris Terai ran this test without it crashing...
Noel and Peter,
The ERS test that ran for me had the form: ERS.ne11_ne11.FSCREAM-LR.cori-knl_intel It ran with a master about a week old (git hash - c4c0217ed) and a most recent version of master (git hash - 6e073ad93).
It looks like there are only a couple differences between our tests. I don't know what the '_D' option at the end of 'ERS' does but that's one difference. Noel - is there a reason you had used ERS_D rather than just ERS? I also didn't specify which intel compiler to use, I just set the machine_compiler to cori-knl_intel. I'm not sure whether this could cause the error that Noel saw though.
For your reference, the location of my ne11_ne11 tests are:
/global/cscratch1/sd/terai/acme_scratch/cori-knl/ERS.ne11_ne11.FSCREAM-LR.cori-knl_intel.20200203_082409_ko6k4g
and
/global/cscratch1/sd/terai/acme_scratch/cori-knl/ERS.ne11_ne11.FSCREAM-LR.cori-knl_intel.20200130_141227_4wyg32
_D
flag turns on all debug flags for that compiler and machine. Errors such as above are caught when we turn on these flags (specifically fpe0
in this case). This could easily be one of the reasons the test is blowing up.
I verified same failure with scream repo as of today (feb 3rd).
create_test ERS_D.ne11_ne11.FSCREAM-LR --compiler=intel19
Further debugging, I see that the variable conv_vel(i,k)
turns negative when wthv_sec(i,k)
is negative. If that helps.
@bogensch - it looks like Noel is getting negative fractional power errors in SHOC. Can you look into this? I've asked Noel to create a simple test case for you to run. Apparently this issue is only found using the gnu and intel19 compilers - not the standard... so you'll probably have to debug this on NERSC(?).
To reproduce:
cd cime/scripts.
create_test ERS_D.ne11_ne11.FSCREAM-LR --compiler=intel19
Which is exact restart test, so I suspect a simple smoke test will also fail same way, ie:
create_test SMS_D.ne11_ne11.FSCREAM-LR --compiler=intel19
Ah yes... I should have just seen that in your comment above. It might be easier to debug from a create_case... submission rather than create_test kind of line, but I guess the other Peter's clever enough to make that change if needed.
I verified that SMS_D.ne11_ne11.FSCREAM-LR --compiler=intel19
also fails.
As does SMS_D_PMx1.ne11_ne11.FSCREAM-LR --compiler=intel19
.
(runs with only 1 thread)
I also have a simple launch script (that includes create_newcase command) if that helps.
I reproduced the error too. I will look into it.
From: noel notifications@github.com
Reply-To: E3SM-Project/scream reply@reply.github.com
Date: Monday, February 3, 2020 at 5:56 PM
To: E3SM-Project/scream scream@noreply.github.com
Cc: "Bogenschutz, Peter Andrew" bogenschutz1@llnl.gov, Mention mention@noreply.github.com
Subject: Re: [E3SM-Project/scream] Attempt to raise negative value to fraction power in shoc.F90 forrtl: error (65): floating invalid
with Intel v19 on cori-knl (#218)
I verified that SMS_D.ne11_ne11.FSCREAM-LR --compiler=intel19 also fails. And I have a simple launch script (that includes create_newcase command) if that helps.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/E3SM-Project/scream/issues/218?email_source=notifications&email_token=AF7SOHE3ACB7R3SGJBLGUHTRBDDLVA5CNFSM4KPLNMC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKWB5OA#issuecomment-581705400, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF7SOHAORQDB36HZOIWTNU3RBDDLVANCNFSM4KPLNMCQ.
On cori-knl, the test
ERS_D.ne11_ne11.FSCREAM-LR --compiler=intel19
is failing for me. I used the screm repo from Jan 31st.The run also fails in same way with GNU compiler in DEBUG (after a compiler flag modification discussed here https://github.com/E3SM-Project/E3SM/issues/3270).
This test is not failing with current E3SM default Intel compiler version (18).
The ne4 resolution tests are all passing, but ne11, ne16, and ne30 fail in this way.