E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
348 stars 355 forks source link

IO error on Edison #948

Closed golaz closed 6 years ago

golaz commented 8 years ago

I ran a short test run with the high-res coupled model on Edison today and encountered a strange IO error:

0013: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/acme/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_0000-01-ne120np4_L72_c160318.nc'  error='Input/
output error'  errno=5  PE=00013  R_rec=00007  off=7235174400  len=0004194304  See MPICH_MPIIO_ABORT_ON_RW_ERROR.
0001:  pio_support::pio_die:: myrank=          -1 : ERROR: pionfread_mod.F90.in:
0001:          204 : Unknow error occurs in reading file

@polunma also reported seeing similar error with a low-res AV1C compset.

cameronsmith1 commented 8 years ago

Has NERSC been contacted to see if it is a system issue?

polunma commented 8 years ago

Yes today I am seeing exactly the same error message with the ne30 model on edison!

rljacob commented 8 years ago

There haven't been any recent changes to PIO so likely a system issue. @helenhe40

jayeshkrishna commented 8 years ago

@golaz : Can you also include information here (script to run OR ./create_newcase details) on how to recreate this issue? It looks like a system issue but might be worth trying out on an another system.

amametjanov commented 8 years ago

The /project file system was down at NERSC yesterday. Should be back up today. Please re-try.

We can ping Stephen Leak (@sleak-lbl) with system issues too, while Helen is out-of-office until July 26th.

golaz commented 8 years ago

@amametjanov, that would explain the error. I have a job in the queue to try again.

polunma commented 8 years ago

I still see the same error (10:20am today).

polunma commented 8 years ago

The problem is gone (July 13, 14:26)!!

ndkeen commented 8 years ago

I feel like we should be able to be independent on the GPFS /project filesystem if we desire/need to be. It's possible that this filesystem could be a factor in the slowdowns on edison that have been plaguing us on other issues. Either way, it would be nice to have an easy way to: 1) populate all of the required input data for the run in scratch (instead of project) 2) write timing data elsewhere 3) avoid any other use of /project i don't know about yet

mt5555 commented 8 years ago

for the record, i think we already have #1. if you set your /project directory somewhere else, the build scripts will download all the necessary files from the inputdata server (and some files for which we still get from NCAR come from the cesm inputdata server).

As NERSC POC, if you think we should change the location, I say go ahead and point it to a new directory, and then the build scripts will slowly populate it as we start running.

ndkeen commented 8 years ago

I have tried changing the location of inputdata in the past and had trouble getting the required files. I'm sure we can fix that -- I have access to the first place to look for files, but not the next higher up. The only thing I've been able to do is simply copying all of the input data from /project to my own /scratch. Your suggestion to just move where we get data to /scratch may not be a bad idea. I can think of several methods of managing -- ultimately having options is best. I will think about it.

mt5555 commented 8 years ago

Everyone running the model should have access to both repositories. See the ACME quick start guide: (step 4)

https://acme-climate.atlassian.net/wiki/display/Docs/Development+Quick+Guide

polunma commented 8 years ago

All my runs crashed again because of this issue today:

112: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC//monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00112 R_rec=00176 off=2777677824 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 223: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC//monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00223 R_rec=00206 off=2778726400 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 334: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC//monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00334 R_rec=00178 off=2779774976 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 001: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC//monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00001 R_rec=00714 off=2776629248 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 001: pio_support::pio_die:: myrank= -1 : ERROR: pionfget_mod.F90: 001: 289 : Unknow error occurs in reading file

helenhe40 commented 8 years ago

NERSC project directory is in degraded mode. Some read will get input/output error. Write is always successful.

Please see MOTD for update.

Helen

On Jul 16, 2016, at 5:44 AM, Po-Lun Ma notifications@github.com wrote:

All my runs crashed again because of this issue today:

112: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC// monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00112 R_rec=00176 off=2777677824 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 223: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC// monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00223 R_rec=00206 off=2778726400 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 334: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC// monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00334 R_rec=00178 off=2779774976 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 001: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC// monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00001 R_rec=00714 off=2776629248 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 001: pio_support::pio_die:: myrank= -1 : ERROR: pionfget_mod.F90: 001: 289 : Unknow error occurs in reading file

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ACME-Climate/ACME/issues/948#issuecomment-233106038, or mute the thread https://github.com/notifications/unsubscribe-auth/AKR8KRyP72HAljCVnA62KIQDjOvh4cNuks5qWFN-gaJpZM4JK7Lh .

ndkeen commented 6 years ago

I think we can close this now as it looks like it was an issue for a short time during NERSC filesystem problems and we haven't seen (this exact) issue again.

golaz commented 6 years ago

Yes, we can close.