Apparent SCORPIO failure in long production segments

golaz commented 4 years ago

I'm running a long (100+ years) low-res coupled simulation on compy with a very recent version of master (92d0e8ef8014a892edb53613411baaee582afba6). This is my first long simulation with Scorpio (PIO_VERSION=2). I configured the simulation to run in segments of 20 years each. The first three segments so far have all failed during IO late in year 16 of the segment. One job crashed with a segmentation fault and the other two jobs hung without advancing until I killed and resubmitted them. They all restarted fine from previous restart files.

Given the systematic nature of the failures (3 out of 3 and all in year 16), there might be some lingering issues with Scorpio (PIO_VERSION=2) that are causing these failures.

For details of the simulation, see

https://acme-climate.atlassian.net/wiki/spaces/EWCG/pages/1572570150/20200702.alpha3+0.piControl.ne30pg2+r05+oECv3+ICG.compy

rljacob commented 4 years ago

sounds like a memory leak. You have to run 16 years to see it? How long does that take (in wallclock)?

PeterCaldwell commented 4 years ago

Definitely sounds like a memory leak. May be quicker to reproduce by switching to daily rather than monthly h0 files? Also, I wonder if we can reproduce it as easily with ne4 rather than whatever higher-res run you're using?

jayeshkrishna commented 4 years ago

@golaz : Can you also include the contents of the run directory (ls -lt) ?

The hang/crash seems to be while creating a new file. Can you also ensure that you are not running out of disk/quota space?

golaz commented 4 years ago

This configuration runs at approximately 20 SYPD, so it takes less than 24 hours. Presumably this could be reproduced faster by having more frequent output.

golaz commented 4 years ago

@jayeshkrishna : there are over 1600 files currently in the run directory, so it would not be practical to list them here. The directory is

/compyfs/gola749/E3SM_simulations/20200702.alpha3_0.piControl.ne30pg2_r05_oECv3_ICG.compy/run

While the disk on compy is quite full, I was able to always restart the simulation so that's probably not the root cause.

ambrad commented 4 years ago

@rljacob would any of the rss output in e3sm.log be of use? I looked at the three rss output sets, for ranks 0, 2160, and 2720, e.g., with grep "^2160.* rss" e3sm.log.114981.200705-070555. While all three series suggest small leaks are possible, none of the three point to a large one, and arithmetic suggests none is close to causing the 192 GB on a Compy node to be exhausted.

rljacob commented 4 years ago

I can never remember how the memory diags work. But yes it seems hard to exhaust 192 GB without something obvious.

mt5555 commented 4 years ago

could it be related to this issue:

https://github.com/E3SM-Project/scorpio/issues/315

exceeding the 4K IO id's? the fact that it takes a long time to trigger made me think of that issue - which would only occur after writing many files.

jayeshkrishna commented 4 years ago

There are 3 cases that @golaz has listed out in confluence, the first one is a crash, the other two are hangs. The crash occurs while creating a file and the hangs seem to be (from logs) while writing contents of the history file. Its unlikely that E3SM-Project/scorpio#315 is the cause, @dqwu any thoughts?

dqwu commented 4 years ago

There are 3 cases that @golaz has listed out in confluence, the first one is a crash, the other two are hangs. The crash occurs while creating a file and the hangs seem to be (from logs) while writing contents of the history file. Its unlikely that E3SM- Project/scorpio#315 is the cause, @dqwu any thoughts?

Maybe we should first try latest scorpio master branch to see if this issue is still reproducible.

cd externals/scorpio
git checkout master

tangq commented 4 years ago

Some additional information that may be helpful.

I am running the RRM test with the same settings as @golaz on cori-knl. The simulation completed 5 years successfully at /global/cscratch1/sd/tang30/E3SM_simulations/20200701.v1like.f2010.northamericax4v1pg2_r0125_northamericax4v1pg2.cori-knl.

I checked that my run uses the PIO_TYPENAME of pnetcdf. Is that PIO2?

dqwu commented 4 years ago

Some additional information that may be helpful.

I am running the RRM test with the same settings as @golaz on cori-knl. The simulation completed 5 years successfully at /global/cscratch1/sd/tang30/E3SM_simulations/20200701.v1like.f2010.northamericax4v1pg2_r0125_northamericax4v1pg2.cori-knl.

I checked that my run uses the PIO_TYPENAME of pnetcdf. Is that PIO2?

E3SM uses scorpio by default now. You can double check bld/pio.bldlog.*.gz to see if scorpio is being used to build e3sm.exe

tangq commented 4 years ago

@dqwu , thanks for the information. The pio building log shows my run uses the scorpio.

singhbalwinder commented 4 years ago

env_build.xml has the following field which tells us whether it is PIO1 or PIO2 (Scorpio): <entry id="PIO_VERSION" value="2">

tangq commented 4 years ago

@singhbalwinder 's method also confirms the run uses PIO2.

jayeshkrishna commented 4 years ago

@golaz / @tangq : Can I run the simulation with ne4 resolution (I want to try a run for 20 years with minimal resources to check if I can recreate the issue)?

(PIO_VERSION = 2 implies you are using Scorpio, which is derived from PIO2. PIO_TYPENAME indicates the low-level library used by Scorpio, and pnetcdf is the recommended library to be used with Scorpio)

tangq commented 4 years ago

@jayeshkrishna , you can try to use ne4 to reproduce the issue. If successful, it will be helpful to identify the causes.

Thanks for the explanations on the scorpio settings.

jayeshkrishna commented 4 years ago

Does anyone know how I can run a similar configuration to A_WCYCL1850S_CMIP6 + ne30pg2_r05_oECv3_ICG using the ne4 resolution? Simple attempts (adding a new grid config : ne4pg2 + r05 + oQU240 etc) all fail due to missing mapping files etc

Would ne4_oQU240 + A_WCYCL1850 work?

jayeshkrishna commented 4 years ago

@jonbob has ne4 resolution for pg2 in PR #3606 . I will try that and see how it goes. thanks @jonbob !

mt5555 commented 4 years ago

as an FYI, PR #3606, I think the surfdata files for some configurations are missing. So for example F2010SC5-CMIP6 wont build out of the box, but FC5AV1C-L will. (I've been running FC5AV1C-L ne4pg2_ne4pg2 )

jayeshkrishna commented 4 years ago

I was able to run a simple 5 day run with FC5AV1C-L+ne4pg2_ne4pg2 with minor modifications to the user_nl_cam that @golaz used for his runs. Since the issue seems to be related to CAM outputs I am going to run a 2 yr run using this config (and updating output settings based on @golaz 's run script) and then do a 20 year run. I will update the issue with my run script soon. Thanks @mt5555

Unfortunately even with the branch in PR #3606 I wasn't able to get the ne4 runs to work with F2010SC5-CMIP6. There were still some missing mapping files (I downloaded the ones that were available and also created the MPAS partition files for compy but there were still some missing mapping files. Once I get the 20 yr run going with FC5AV1C-L+ne4pg2_ne4pg2 I will look into it in more detail). I will also try the ne16 configs that were added in the PR.

jayeshkrishna commented 4 years ago

The 2 year run was successful with FC5AV1C-L+ne4pg2_ne4pg2 and I have submitted a run for 20yrs. My run script is given below for reference,

#!/bin/bash
module unload python
module load python/2.7.9
./create_newcase -case FC5AV1C-L_ne4pg2_ne4pg2_20yrs -compset FC5AV1C-L -res ne4pg2_ne4pg2 --handle-preexisting-dirs r
cd FC5AV1C-L_ne4pg2_ne4pg2_20yrs

./xmlchange JOB_WALLCLOCK_TIME='08:00:00'

./xmlchange CAM_TARGET=theta-l
./xmlchange --id CAM_CONFIG_OPTS --append --val='-cosp'

./xmlchange --id STOP_OPTION --val 'nyears'
./xmlchange --id STOP_N      --val 20

./xmlchange --id REST_OPTION --val 'nyears'
./xmlchange --id REST_N      --val 5

./xmlchange --id BUDGETS     --val 'true'

./xmlchange --id HIST_OPTION --val 'nyears'
./xmlchange --id HIST_N      --val 5

# FIXME : Add the variable below back to user_nl_cam
#ncdata = '../init/20171228.beta3rc13_1850.ne30_oECv3_ICG.edison.cam.i.0331-01-01-00000.nc'

cat <<EOF >> user_nl_cam
nhtfrq =   0,-24,-6,-6,-3,-24
mfilt  = 1,30,120,120,240,30
avgflag_pertape = 'A','A','I','A','A','A'
fexcl1 = 'CFAD_SR532_CAL'
fincl1 = 'extinct_sw_inp','extinct_lw_bnd7','extinct_lw_inp','CLD_CAL', 'TREFMNAV', 'TREFMXAV'
fincl2 = 'FLUT','PRECT','U200','V200','U850','V850','Z500','OMEGA500','UBOT','VBOT','TREFHT','TREFHTMN','TREFHTMX','QREFHT','TS','PS','TMQ','TUQ','TVQ','TOZ'
fincl3 = 'PSL','T200','T500','U850','V850','UBOT','VBOT','TREFHT'
fincl4 = 'FLUT','U200','U850','PRECT','OMEGA500'
fincl5 = 'PRECT','PRECC'
fincl6 = 'CLDTOT_ISCCP','MEANCLDALB_ISCCP','MEANTAU_ISCCP','MEANPTOP_ISCCP','MEANTB_ISCCP','CLDTOT_CAL','CLDTOT_CAL_LIQ','CLDTOT_CAL_ICE','CLDTOT_CAL_UN','CLDHGH_CAL','CLDHGH_CAL_LIQ','CLDHGH_CAL_ICE','CLDHGH_CAL_UN','CLDMED_CAL','CLDMED_CAL_LIQ','CLDMED_CAL_ICE','CLDMED_CAL_UN','CLDLOW_CAL','CLDLOW_CAL_LIQ','CLDLOW_CAL_ICE','CLDLOW_CAL_UN'

ieflx_opt = 2 ! =0 AMIP simulations, = 2 for coupled

clubb_c_K10h = 0.30
clubb_c14 = 1.06D0
dust_emis_fact =  1.50D0
linoz_psc_T = 197.5
EOF

./case.setup
./case.build
./case.submit

jayeshkrishna commented 4 years ago

I am able to recreate the issue with the ne4 simulations above. The error seems to occur in year 16 (crashes or hangs like listed in the issue above) and seems to be related some memory corruption while writing out files (year 15?). I am able to run the ne4 case successfully for 20 years after applying the fix in E3SM-Project/scorpio#326 and using the latest Scorpio master.

I am currently doing some debug runs with the ne4 case to narrow down the issue. We will be integrating the PR 326 to Scorpio master and bringing the latest version of Scorpio to E3SM this week.

E3SM-Project / E3SM

Apparent SCORPIO failure in long production segments #3684