Open penguian opened 7 years ago
@martin.dix@anu.edu.au edited the issue description
@martin.dix@anu.edu.au changed title from CICE error handling
to OASIS error handling
@martin.dix@anu.edu.au commented
In oasis psmile/src/mod_oasis_io.F90
has several nf90_open calls
with the same structure
inquire(file=trim(filename),exist=exists)
if (exists) then
status = nf90_open(trim(filename),NF90_NOWRITE,ncid)
IF (status /= nf90_noerr) WRITE(nulprt,*) subname,' model :',compid,' proc :',&
mpi_rank_local,':',TRIM(nf90_strerror(status))
else
write(nulprt,*) subname,' ERROR: file missing ',trim(filename)
WRITE(nulprt,*) subname,' abort by model :',compid,' proc :',mpi_rank_local
CALL oasis_flush(nulprt)
call oasis_abort_noarg()
endif
If the file is missing, this calls oasis_abort_noarg
which has
CALL MPI_ABORT (mpi_comm_global, 0, ierror)
as does oasis_abort
. There's also an oasis_mpi_abort
routine which does set a non-zero error status.
The file missing message is written to ATM_RUNDIR/debug.root.01
(when it fails on PE0).
In OASIS3-MCT3.0 the same file handling code calls oasis_abort()
which calls MPI_ABORT
with a default errorcode of 1 if it's not set otherwise as an argument.
@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive
@martin.dix@anu.edu.au changed _comment1 which not transferred by tractive
@martin.dix@anu.edu.au changed _comment2 which not transferred by tractive
@martin.dix@anu.edu.au changed status from assigned
to accepted
@martin.dix@anu.edu.au set owner to mrd599
@martin.dix@anu.edu.au edited the issue description
@martin.dix@anu.edu.au commented
I initially incorrectly blamed CICE and noticed that the CICE routine mpi/ice_exit.F90
has
write (ice_stderr,*) error_message
call flush_fileunit(ice_stderr)
call MPI_ABORT(MPI_COMM_WORLD, ierr)
The MPI_ABORT
call is missing the errorcode argument, and so gets a value of 0 from ierr. This is used as the final program exit status so cylc thinks it succeeded.
Routine definition at https://www.open-mpi.org/doc/v1.10/man3/MPI_Abort.3.php
Also ice_fileunits.F90
has
ice_stderr = 6 ! reserved unit for standard error
so the error message gets written to job.out not job.err. As an example, this happens when running CICE with a decomposition inconsistent with the executable.
@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive
@martin.dix@anu.edu.au commented
OASIS branch https://access-svn.nci.org.au/trac/oasis/browser/branches/dev/mrd599/oasis3-mct-errorhandling duplicates the error messages to stderr and also ensures a non-zero exit status from an abort. A run without a2i.nc now has this message in job.err
oasis_io_read_avfile ERROR: file missing a2i.nc
oasis_io_read_avfile model : 1 proc : 0
oasis_io_read_avfile abort by model : 1 proc : 0
and cylc recognises that it's failed.
@martin.dix@anu.edu.au commented
The CICE changes mentioned in comment 4 above are now included in the standard CMIP6 branch https://access-svn.nci.org.au/trac/cice/changeset/403
| by mrd599@nci.org.au
By running individual components of the coupled suite explicitly from the cylc GUI I accidentally started a model where some of the oasis initial files weren't present (e.g. a2i.nc).
The job.out file reported
job.err also had stack traces from the UM abort, but cylc reported that the run had succeeded. In this case PE576 is the first PE used by CICE. Running on a smaller atmospheric decomposition gave the ABORT from PE0, so it depends which component finds that its required file is missing first.
Issue migrated from trac:318 at 2024-01-31 18:29:14 +1100