ACCESS-NRI / accessdev-Trac-archive

Archive accessdev Trac contents as issues
Apache License 2.0
0 stars 0 forks source link

OASIS error handling #318

Open penguian opened 7 years ago

penguian commented 7 years ago

| by mrd599@nci.org.au


By running individual components of the coupled suite explicitly from the cylc GUI I accidentally started a model where some of the oasis initial files weren't present (e.g. a2i.nc).

The job.out file reported

MPI_ABORT was invoked on rank 576 in communicator MPI_COMM_WORLD 
with errorcode 0.

job.err also had stack traces from the UM abort, but cylc reported that the run had succeeded. In this case PE576 is the first PE used by CICE. Running on a smaller atmospheric decomposition gave the ABORT from PE0, so it depends which component finds that its required file is missing first.


Issue migrated from trac:318 at 2024-01-31 18:29:14 +1100

penguian commented 7 years ago

@martin.dix@anu.edu.au edited the issue description

penguian commented 7 years ago

@martin.dix@anu.edu.au changed title from CICE error handling to OASIS error handling

penguian commented 7 years ago

@martin.dix@anu.edu.au commented


In oasis psmile/src/mod_oasis_io.F90 has several nf90_open calls with the same structure

      inquire(file=trim(filename),exist=exists)
      if (exists) then
         status = nf90_open(trim(filename),NF90_NOWRITE,ncid)
         IF (status /= nf90_noerr) WRITE(nulprt,*) subname,' model :',compid,' proc :',&
                                                   mpi_rank_local,':',TRIM(nf90_strerror(status))
      else
         write(nulprt,*) subname,' ERROR: file missing ',trim(filename)
         WRITE(nulprt,*) subname,' abort by  model :',compid,' proc :',mpi_rank_local
         CALL oasis_flush(nulprt)
         call oasis_abort_noarg()
      endif

If the file is missing, this calls oasis_abort_noarg which has

CALL MPI_ABORT (mpi_comm_global, 0, ierror)

as does oasis_abort. There's also an oasis_mpi_abort routine which does set a non-zero error status.

The file missing message is written to ATM_RUNDIR/debug.root.01 (when it fails on PE0).

In OASIS3-MCT3.0 the same file handling code calls oasis_abort() which calls MPI_ABORT with a default errorcode of 1 if it's not set otherwise as an argument.

penguian commented 7 years ago

@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive

penguian commented 7 years ago

@martin.dix@anu.edu.au changed _comment1 which not transferred by tractive

penguian commented 7 years ago

@martin.dix@anu.edu.au changed _comment2 which not transferred by tractive

penguian commented 7 years ago

@martin.dix@anu.edu.au changed status from assigned to accepted

penguian commented 7 years ago

@martin.dix@anu.edu.au set owner to mrd599

penguian commented 7 years ago

@martin.dix@anu.edu.au edited the issue description

penguian commented 7 years ago

@martin.dix@anu.edu.au commented


I initially incorrectly blamed CICE and noticed that the CICE routine mpi/ice_exit.F90 has

      write (ice_stderr,*) error_message
      call flush_fileunit(ice_stderr)

      call MPI_ABORT(MPI_COMM_WORLD, ierr)

The MPI_ABORT call is missing the errorcode argument, and so gets a value of 0 from ierr. This is used as the final program exit status so cylc thinks it succeeded.

Routine definition at https://www.open-mpi.org/doc/v1.10/man3/MPI_Abort.3.php

Also ice_fileunits.F90 has

       ice_stderr =  6    ! reserved unit for standard error

so the error message gets written to job.out not job.err. As an example, this happens when running CICE with a decomposition inconsistent with the executable.

penguian commented 7 years ago

@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive

penguian commented 7 years ago

@martin.dix@anu.edu.au commented


OASIS branch https://access-svn.nci.org.au/trac/oasis/browser/branches/dev/mrd599/oasis3-mct-errorhandling duplicates the error messages to stderr and also ensures a non-zero exit status from an abort. A run without a2i.nc now has this message in job.err

 oasis_io_read_avfile ERROR: file missing a2i.nc
 oasis_io_read_avfile model :           1  proc :           0
 oasis_io_read_avfile abort by model :           1  proc :           0

and cylc recognises that it's failed.

penguian commented 5 years ago

@martin.dix@anu.edu.au commented


The CICE changes mentioned in comment 4 above are now included in the standard CMIP6 branch https://access-svn.nci.org.au/trac/cice/changeset/403