maint-2.0 branch appears broken on Perlmutter

beharrop commented 8 months ago

The maint-2.0 branch (b583f7d) is having problems on Perlmutter (CPU nodes). Any run that restarts will fail with the following PIO error in the e3sm.log file. PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 180) to file (./20240129.restart_test2.maint_20.elm.h0.0001-02.nc, ncid=143) using PIO_IOTYPE_PNETCDF iotype failed. Non blocking write for variable (TWS_MONTH_BEGIN, varid=205) failed (Number of subarray requests/regions=1, Size of data local to this process = 433). NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/pscratch/sd/b/beharrop/temp/e3sm_code/20240129/E3SM/externals/scorpio/src/clib/pio_darray_int.c: 395)

I tried to reproduce this with a shorter test using ./create_test ERS.ne4_oQU240.F2010, but that did not reproduce the error. ERS.ne4_oQU240.F2010.pm-cpu_intel (Overall: PASS). I have also been unable to reproduce this error on either Chrysalis or Compy. The smallest reproducible setup I've been able to come to that has this error is the following.

./create_newcase --case ${casename}  --compset F2010 --res ne4_oQU240 --walltime 00:08:00 --script-root ${casedir} --output-root ${outputdir} -mach pm-cpu

cd ${casedir}

./xmlchange STOP_N=1
./xmlchange STOP_OPTION=nmonths
./xmlchange RESUBMIT=1

./case.setup
./case.build
./case.submit

I have been using the intel compiler, and haven't tested any of the others yet. Has anyone else encountered anything like this?

ndkeen commented 8 months ago

Thanks for finding reproducer. I can also get the same fail using ERS, but running longer: ERS_Lm3.ne4_oQU240.F2010.pm-cpu_intel Also, maybe related, maybe not, this test completes, but fails compare: ERS_Ld31.ne4_oQU240.F2010.pm-cpu_intel

So then the issue is what changed? I looked again the changes i made to machine config to maint-2.0 recently and I don't see anything that would cause trouble. I then used some previous maint-2.0 branches from the past -- back to July, all of them also have this same failure pattern (fail with ERS_Lm3, fail compsre with ERS_Ld31).

I then tried the same tests with a recent E3SM master and at least these 2 tests both PASS.

So I'm curious if this is something that ever working on pm-cpu with maint-2.0?

ndkeen commented 8 months ago

If I run the ERS test with DEBUG, I get:

 12: forrtl: error (65): floating invalid
 12: Image              PC                Routine            Line        Source
 12: libpthread-2.31.s  000014E799803910  Unknown               Unknown  Unknown
 12: e3sm.exe           0000000004F6368D  subgridavemod_mp_        1045  subgridAveMod.F90
 12: e3sm.exe           000000000663C0DC  waterbudgetmod_mp         719  WaterBudgetMod.F90
 12: e3sm.exe           0000000004A4E29E  elm_driver_mp_elm         576  elm_driver.F90
 12: e3sm.exe           00000000049F29F0  lnd_comp_mct_mp_l         506  lnd_comp_mct.F90
 12: e3sm.exe           0000000000496175  component_mod_mp_         751  component_mod.F90
 12: e3sm.exe           00000000004583F7  cime_comp_mod_mp_        2876  cime_comp_mod.F90
 12: e3sm.exe           000000000047EB62  MAIN__                    153  cime_driver.F90
 12: e3sm.exe           000000000042342D  Unknown               Unknown  Unknown
 12: libc-2.31.so       000014E79923E24D  __libc_start_main     Unknown  Unknown

Which looks familiar -- will search github I found a similar-looking issue that claims https://github.com/E3SM-Project/E3SM/pull/5341 fixed it.

beharrop commented 8 months ago

Thanks for looking into this, @ndkeen. I can say that as of last October I was successfully using a maint-2.0 like version of the code with some of my own developments living on a branch in my E3SM fork (https://github.com/beharrop/E3SM/commit/f113954b5a986305ede13ff155816b4c51051ee4). It has the most recent SCORPIO updates for maint-2.0 (https://github.com/E3SM-Project/scorpio/commit/de0b1ca2200f62c6eb5e3fd40147965409e97123), so I would think they would handle IO the same. I was able to run a decade of simulations one year at a time with that code. I will try running your ERS_Lm3.ne4_oQU240.F2010.pm-cpu_intel test with that code to see if it also fails.

I also know a colleague who has been running E3SMv2 (again with modifications) for an RGMA project who is getting the same error as I am. Her issue only just started last week, which tracks with what @milenaveneziani is seeing.

beharrop commented 8 months ago

Running the ERS_Lm3.ne4_oQU240.F2010.pm-cpu_intel test with my fork that was working last October produces the same failure.

ndkeen commented 8 months ago

I don't see the fail with maint-2.1 -- which may be closer to maint-2.0. Note that I pointed to a PR that fixed some of the NaN issues in master -- however, I don't see that being in maint-2.1 or maint-2.0 -- so not sure that PR (which involves many files), is really the culprit here.

I have not been able to get the fail with several E3SM master checkouts I have from several months ago.

I also don't see a fail with SMS_D_Lm3.ne4_oQU240.F2010.pm-cpu_intel, but that does not write a restart. If I manually request restart an re-run it's also OK -- this suggests that the issue is not in writing the restarts, but likely in reading the restarts. And not just in reading any restart, but only restart after at least a month.

beharrop commented 8 months ago

Could it be related to the TWS_MONTH_BEGIN variable? I see that was an issue recently https://github.com/E3SM-Project/E3SM/pull/5811. It does appear that variable has NaNs in the restart files I've looked at.

ndkeen commented 8 months ago

It is most certainly having issue with a NaN.

The routine where it fails is

  subroutine c2g_1d(bounds, carr, garr, c2l_scale_type, l2g_scale_type)
    ! !DESCRIPTION:                                                                                                                                                                                                                                                                                         
    ! Perfrom subgrid-average from columns to gridcells.                                                                                                                                                                                                                                                    
    ! Averaging is only done for points that are not equal to "spval".

And it looks like input array carr has a value that is NaN.

But that does not explain why something would have been working with maint-2.0 before, but not now. Which maybe doesn't matter, except it might be a clue to help us find the issue. And to clarify, can we devise a reproducer of something that did restart OK before? Because technically, we don't know for sure if this culprit of this error tripped via the ERS test is indeed the same issue you see in your case.

ndkeen commented 8 months ago

I already tried adding the fix noted in PR5811 -- but that only impacts tests using that user_nl_elm (cime_config/testmods_dirs/allactive/wcprodssp/user_nl_elm), which would be in those with allactive modifier (likely fully coupled cases). It did not help the issue here, but may help some of the coupled cases (which are not currently in the e3sm_prod test suite on maint-2.0).

beharrop commented 8 months ago

Yea, I noticed the fix to #5811 was just a swap for a corrected restart file, but that's what had me thinking that perhaps the problem is related to how the model is writing restart files with the current maint-2.0 branch. Following that chain of reasoning, I set TWS_MONTH_BEGIN to be exactly 3 for all indexes of the array (replacing the NaNs) to see if that would help anything (no idea if that is a reasonable value for such a variable). Still the same result.

Unfortunately, I haven't had any luck so far coming up with a reproducer of something that did restart OK before.

ndkeen commented 8 months ago

fwiw, I get same error with ne30 ERS_Lm3.ne30pg2_EC30to60E2r2.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp

I also tried several tests to reduce the optimization flags to no avail.

rljacob commented 8 months ago

The NaN problem was from when we turned on debugging and NaN trapping, it found NaNs in some restart files that had to be removed. But maint-2.0 has the old debug flags so it shouldn't matter and Bryce wasn't doing a debug compile. Adding @dqwu since its a scorpio error.

ndkeen commented 8 months ago

Rob: is still doesn't explain why the users are claiming this was working before and now it does not. I can try to merge in that PR, but it's a lot of files and I think maybe someone else should do that.

Also, maint-2.1 seems to work and I do not think it has the land PR either...

ndkeen commented 8 months ago

Trying on chrysalis ERS_D_Lm3.ne4_oQU240.F2010.chrysalis_intel also fails While these two only fail compare:

ERS_Ld31.ne4_oQU240.F2010.chrysalis_intel
ERS_Lm3.ne4_oQU240.F2010.chrysalis_intel

rljacob commented 8 months ago

Try going back to c16b21f on the maint-2.0 branch. That's before the SCORPIO update.

ndkeen commented 8 months ago

I already tried July21st checkout maint-2.0 with same results

af7e689913 2022-06-02 11:09:37 -0600 Merge pull request #5005 from E3SM-Project/jgfouca/cime_maint_update_2022_06_01
2463015404 2022-04-21 14:09:46 -0600 Merge pull request #4905 from E3SM-Project/jgfouca/maint-cime-update
4af5914734 2021-12-16 09:10:07 -0700 Merge pull request #4708 from E3SM-Project/wadeburgess-patch-1
146cedfc8f 2021-03-26 12:09:43 -0500 Merge pull request #832 from akturner/seaice/mpas_seaice_paper
f7ab2117cb 2021-03-19 13:43:23 -0600 Merge pull request #3879 from ESMCI/azamat/mprof/replace-indef-fmt-xlf
b4d6e4677b 2021-03-04 15:44:26 -0600 Merge pull request #16 from akturner/seaice/error_analysis

I do have one idea that I'm trying now -- nope that wasn't it

ndkeen commented 8 months ago

With GNU, I don't get the NaN crash, but I do see both tests fail compare:

ERS_D_Lm3.ne4_oQU240.F2010.pm-cpu_gnu
ERS_Lm3.ne4_oQU240.F2010.pm-cpu_gnu

milenaveneziani commented 8 months ago

I don't know if this is helpful at all, but last night I restarted my job (going back to pnetcdf for all components) and I got first this error while writing the mpas-si restart file:

  0: PIO: ERROR: Changing the define mode for file (E3SM-Arcticv2.1_historical0201.mpassi.rst.2003-11-01_00000.nc) 
failed. Low-level I/O library API failed. NetCDF: Operation not allowed in data mode (error num=-38), 
(/global/cfs/cdirs/m1199/milena/E3SM-Model/E3SMv2.1-Arctic_Nov2023/externals/scorpio/src/clib/pioc_support.c:4355)
  0: PIO: ERROR: Writing variable (iceAreaCategory, varid=34) to file (E3SM-Arcticv2.1_historical0201.mpassi.rst.2003-11-01_00000.nc, ncid=90) failed. 
The file was not opened for writing, try reopening the file in write mode (use the PIO_WRITE flag). 
NetCDF: Write to read only (error num=-37), (/global/cfs/cdirs/m1199/milena/E3SM-Model/E3SMv2.1-Arctic_Nov2023/externals/scorpio/src/clib/pio_darray.c:2045)

After a series of those (for a bunch of sea ice variables), I got the usual error:

6016: PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 180) to file 
(./E3SM-Arcticv2.1_historical0201.elm.h0.2003-10.nc, ncid=127) using PIO_IOTYPE_PNETCDF iotype failed. 
Non blocking write for variable (TWS_MONTH_BEGIN, varid=206) failed (Number of subarray requests/regions=1, 
Size of data local to this process = 48600). NetCDF: Numeric conversion not representable (err=-60). 
Aborting since the error handler was set to PIO_INTERNAL_ERROR... 
(/global/cfs/cdirs/m1199/milena/E3SM-Model/E3SMv2.1-Arctic_Nov2023/externals/scorpio/src/clib/pio_darray_int.c: 395)

dqwu commented 8 months ago

For PIO_IOTYPE_PNETCDF, "NetCDF: Numeric conversion not representable" usually means that the write buffer passed to PnetCDF write APIs contains invalid floating point values. @jayeshkrishna I remember a similar issue was reported before.

dqwu commented 8 months ago

A similar issue (though it is using scorpio classic) is #3122 (related to #3110). Need to confirm if incorrect floating point values are used to write variable TWS_MONTH_BEGIN in the failed tests.

rljacob commented 8 months ago

Those are old issues but Bryce's case was working back in October so what changed since then?

milenaveneziani commented 8 months ago

Let me iterate here that the branch I'm working with (close to v2.0 with changes to use intel on pm-cpu) worked just fine on Jan 15, but it started throwing those errors on the 19th..

ndkeen commented 8 months ago

Bryce can you show us your case that was working in October?

dqwu commented 8 months ago

Peter Schwartz reported a similar issue for a recently failed CDash test:

FATES diffs are expected but one test fails on mappy and pm-cpu (doesn't occur on chrysalis) that needs to be looked into. Backtrace points to when the pio tries to close the history file

PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 3780) to file (./ERP_Ld15.ne4pg2_ne4pg2.IELMFATES.pm-cpu_intel.elm-fates_cold_allvars.C.JNextIntegration20240130_205133.elm.h0.0001-01-01-00000.nc, ncid=53) using PIO_IOTYPE_PNETCDF iotype failed. Non blocking write for variable (CMASS_BALANCE_ERROR, varid=76) failed (Number of subarray requests/regions=1, Size of data local to this process = 192). NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/u2/e/e3smtest/jenkins/workspace/ACME_Perlmutter_next_intel/E3SM/externals/scorpio/src/clib/pio_darray_int.c: 395)

jayeshkrishna commented 8 months ago

For PIO_IOTYPE_PNETCDF, "NetCDF: Numeric conversion not representable" usually means that the write buffer passed to PnetCDF write APIs contains invalid floating point values. @jayeshkrishna I remember a similar issue was reported before.

Yeah, usually NaNs in the user buffer (the array being written out)

milenaveneziani commented 8 months ago

So, is it possible that the intel compiler and PIO were more 'forgiving' with writing these nans prior to the Jan 17 downtime, for some reason?

dqwu commented 8 months ago

So, is it possible that the intel compiler and PIO were more 'forgiving' with writing these nans prior to the Jan 17 downtime, for some reason?

I guess there are no NaNs in the write buffer prior to the Jan 17 downtime. Machine environment changes on pm might produce NaNs for the same test to trigger "NetCDF: Numeric conversion not representable (err=-60)".

dqwu commented 8 months ago

Jim Foucar confirmed a similar issue for a new CDash test on mappy with GNU compiler:

PIO: FATAL ERROR: Aborting... An error occured, Writing variable CMASS_BALANCE_ERROR, varid=76, (total number of variables = 3780) to file ./ERP_Ld15.ne4pg2_ne4pg2.IELMFATES.mappy_gnu.elm-fates_cold_allvars.C.JNextDevelope\ r20240131_030810.elm.h0.0001-01-01-00000.nc (ncid=53) using serial I/O failed.. NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/home/e3sm-jenkins/\ jenkins-ws/workspace/mappy_e3sm_next/E3SM/externals/scorpio/src/clib/pio_darray_int.c: 882)

ndkeen commented 8 months ago

Is it the case that an error will not be given if the code writes a NaN to netcdf? As I'm not seeing an error with DEBUG tests writing restarts -- it's only reading them back in. Is it easy to look for NaN's in a given netcdf file? If so, one could count the number of garbage values in files written/read before and after the Jan 17th maint.

beharrop commented 8 months ago

As I mentioned above, I was using a version of maint-2.0 with my own modifications on top (https://github.com/beharrop/E3SM/commit/f113954b5a986305ede13ff155816b4c51051ee4) for the runs I did last October.

The run I was doing lives here: /pscratch/sd/b/beharrop/E3SMv2/v2.LR.F2010.control_original/ and I've set the group ownership to e3sm so anyone here can browse its contents. I've looked back at some of the ELM restarts in that cast, and there are NaNs in those restarts written last October. I look for NaNs with a simple: ncdump -v TWS_MONTH_BEGIN CASENAME.elm.r.YYYY-01-01-00000.nc | grep TWS

milenaveneziani commented 8 months ago

Since in my case that particular error happens when writing elm history files, I checked some files that were written prior to Jan 17 (using ncdump -v TWS_MONTH_BEGIN E3SM-Arcticv2.1_historical0201.elm.h0.2003-10.nc for example) and I see plenty of missing values in them (they appear like this: 6135.854, _, 6086.887, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,). But if I check the last attempted history file (elm.h0.2003-11), then I just get a bunch of zeroes for that variable (so perhaps the file was just created but never updated?).

jayeshkrishna commented 8 months ago

It could also be due to uninitialized buffers (after updates these buffers might not be initialized to 0 etc) used for the variable (being written out to the file)

milenaveneziani commented 8 months ago

@jayeshkrishna, could you please elaborate on this:

It could also be due to uninitialized buffers (after updates these buffers might not be initialized to 0 etc) used for the variable (being written out to the file)

Are you saying that the uninitialized buffers were not a problem before the update but are a problem now?

jayeshkrishna commented 8 months ago

I am just speculating here, but updates to compilers etc can cause compiler strategies to change (for perf etc) regarding uninitialized buffers (all buffers written out need to be initialized to valid/fill values).

beharrop commented 8 months ago

I have been toying with some things to better understand what is happening, and thought I'd share. First, running with do_budgets = .false. in user_nl_elm gets around this error -- presumably because the code that tries to write TWS_MONTH_BEGIN gets skipped. Second, setting the model to run for 30 days and restart a bunch, it only fails when a restart happens on the first of the month. In other words, the model successfully runs when restarting on Jan 31 and on Mar 2, but fails when restarting on Apr 1. Anyone know why this would be a problem only for restarting at the start of the month? There are NaN values for TWS_MONTH_BEGIN in all of the restart files, so that can't be the only factor at play.

peterdschwartz commented 8 months ago

I have been toying with some things to better understand what is happening, and thought I'd share. First, running with do_budgets = .false. in user_nl_elm gets around this error -- presumably because the code that tries to write TWS_MONTH_BEGIN gets skipped. Second, setting the model to run for 30 days and restart a bunch, it only fails when a restart happens on the first of the month. In other words, the model successfully runs when restarting on Jan 31 and on Mar 2, but fails when restarting on Apr 1. Anyone know why this would be a problem only for restarting at the start of the month? There are NaN values for TWS_MONTH_BEGIN in all of the restart files, so that can't be the only factor at play.

@beharrop Yes, I believe I resolved that issue in master with this PR ( https://github.com/E3SM-Project/E3SM/pull/5311 ) . So adding it to maint-2.0 may help

ndkeen commented 8 months ago

OK, after looking at a different error on master with @peterdschwartz, (oh I see he already posted above) he noted that his change here https://github.com/E3SM-Project/E3SM/pull/5311 might impact this issue. After I added these changes, both of these pass:

ERS_Lm3.ne4_oQU240.F2010.pm-cpu_intel
ERS_D_Lm3.ne4_oQU240.F2010.pm-cpu_intel

But that still does not explain how you are saying that it was working before the Jan17th PM maintenance and not after. Seems like it would have always not worked...

beharrop commented 8 months ago

Thanks @peterdschwartz! I tried merging in the changes from #5311 into my code and it can now run without error. I agree with @ndkeen that I don't understand why this only suddenly became a problem, but I am happy to have a fix available. Can the powers that be simply merge #5311 into maint-2.0 (and maint-2.1 if it needs it) or do we need a new PR?

milenaveneziani commented 8 months ago

Thank you all for pointing us to this! I just merged #5311 in my branch as well and waiting for the job to start, hopefully successfully this time. My guess is that, as @jayeshkrishna mentioned, something that changed during maintenance affected the way the compiler deals with the uninitialized buffers. In other words, we were getting away with it before but not now..

rljacob commented 8 months ago

@peterdschwartz please put that fix on the maint-2.0 and maint-2.1 branches. Probably no point in trying to figure out more about what changed. Its annoying when sysadmins change things under a module but it happens.

dqwu commented 8 months ago

@peterdschwartz please put that fix on the maint-2.0 and maint-2.1 branches. Probably no point in trying to figure out more about what changed. Its annoying when systems changes things under a module but it happens.

On the other hand, system/compiler changes might help us find out previously unnoticed issues like uninitialized buffers.

jingtao-lbl commented 7 months ago

I got similar errors for other variables when writing files, including AR, CWDN, TCS_MONTH_BEGIN, etc... @rljacob @ndkeen I'm using the current master version, not the maint-2, though.

384: PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 474) to file (...elm.h0.1985-03.nc, ncid=4347) using PIO_IOTYPE_PNETCDF iotype failed. Non blocking write for variable (CWDN, varid=72) failed (Number of subarray requests/regions=1, Size of data local to this process = 432). NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR...

rljacob commented 7 months ago

If its on latest master, you should make a new issue for it.

rljacob commented 7 months ago

Fixed by #6206

liyptardis commented 7 months ago

I have been toying with some things to better understand what is happening, and thought I'd share. First, running with do_budgets = .false. in user_nl_elm gets around this error -- presumably because the code that tries to write TWS_MONTH_BEGIN gets skipped. Second, setting the model to run for 30 days and restart a bunch, it only fails when a restart happens on the first of the month. In other words, the model successfully runs when restarting on Jan 31 and on Mar 2, but fails when restarting on Apr 1. Anyone know why this would be a problem only for restarting at the start of the month? There are NaN values for TWS_MONTH_BEGIN in all of the restart files, so that can't be the only factor at play.

This solution works for me. Thank you

E3SM-Project / E3SM

maint-2.0 branch appears broken on Perlmutter #6195