Closed golaz closed 6 years ago
Has NERSC been contacted to see if it is a system issue?
Yes today I am seeing exactly the same error message with the ne30 model on edison!
There haven't been any recent changes to PIO so likely a system issue. @helenhe40
@golaz : Can you also include information here (script to run OR ./create_newcase details) on how to recreate this issue? It looks like a system issue but might be worth trying out on an another system.
The /project
file system was down at NERSC yesterday. Should be back up today. Please re-try.
We can ping Stephen Leak (@sleak-lbl) with system issues too, while Helen is out-of-office until July 26th.
@amametjanov, that would explain the error. I have a job in the queue to try again.
I still see the same error (10:20am today).
The problem is gone (July 13, 14:26)!!
I feel like we should be able to be independent on the GPFS /project filesystem if we desire/need to be. It's possible that this filesystem could be a factor in the slowdowns on edison that have been plaguing us on other issues. Either way, it would be nice to have an easy way to: 1) populate all of the required input data for the run in scratch (instead of project) 2) write timing data elsewhere 3) avoid any other use of /project i don't know about yet
for the record, i think we already have #1. if you set your /project directory somewhere else, the build scripts will download all the necessary files from the inputdata server (and some files for which we still get from NCAR come from the cesm inputdata server).
As NERSC POC, if you think we should change the location, I say go ahead and point it to a new directory, and then the build scripts will slowly populate it as we start running.
I have tried changing the location of inputdata in the past and had trouble getting the required files. I'm sure we can fix that -- I have access to the first place to look for files, but not the next higher up. The only thing I've been able to do is simply copying all of the input data from /project to my own /scratch. Your suggestion to just move where we get data to /scratch may not be a bad idea. I can think of several methods of managing -- ultimately having options is best. I will think about it.
Everyone running the model should have access to both repositories. See the ACME quick start guide: (step 4)
https://acme-climate.atlassian.net/wiki/display/Docs/Development+Quick+Guide
All my runs crashed again because of this issue today:
112: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC//monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00112 R_rec=00176 off=2777677824 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 223: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC//monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00223 R_rec=00206 off=2778726400 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 334: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC//monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00334 R_rec=00178 off=2779774976 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 001: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC//monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00001 R_rec=00714 off=2776629248 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 001: pio_support::pio_die:: myrank= -1 : ERROR: pionfget_mod.F90: 001: 289 : Unknow error occurs in reading file
NERSC project directory is in degraded mode. Some read will get input/output error. Write is always successful.
Please see MOTD for update.
Helen
On Jul 16, 2016, at 5:44 AM, Po-Lun Ma notifications@github.com wrote:
All my runs crashed again because of this issue today:
112: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC// monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00112 R_rec=00176 off=2777677824 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 223: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC// monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00223 R_rec=00206 off=2778726400 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 334: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC// monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00334 R_rec=00178 off=2779774976 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 001: ADIOI_CRAY_READCONTIG(288): filename='/project/projectdirs/PNNL-PJR/csm/inputdata/atm/cam/chem/trop_mam/marine_BGC// monthly_macromolecules_0.1deg_bilinear_latlon_year01_merge_date.nc' error='Input/output error' errno=5 PE=00001 R_rec=00714 off=2776629248 len=0001048576 See MPICH_MPIIO_ABORT_ON_RW_ERROR. 001: pio_support::pio_die:: myrank= -1 : ERROR: pionfget_mod.F90: 001: 289 : Unknow error occurs in reading file
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ACME-Climate/ACME/issues/948#issuecomment-233106038, or mute the thread https://github.com/notifications/unsubscribe-auth/AKR8KRyP72HAljCVnA62KIQDjOvh4cNuks5qWFN-gaJpZM4JK7Lh .
I think we can close this now as it looks like it was an issue for a short time during NERSC filesystem problems and we haven't seen (this exact) issue again.
Yes, we can close.
I ran a short test run with the high-res coupled model on Edison today and encountered a strange IO error:
@polunma also reported seeing similar error with a low-res AV1C compset.