COSIMA / access-om3

ACCESS-OM3 global ocean-sea ice-wave coupled model
13 stars 7 forks source link

Failing WW3 Restart File Read in WW3 #97

Closed ezhilsabareesh8 closed 5 months ago

ezhilsabareesh8 commented 8 months ago

In the MOM6-CICE6-WW3 IAF and RYF configs runs well for the initial run with a cold start. However, upon attempting to restart the simulation, an MPI error occurs here due to failing MPI_waitall function.

anton-seaice commented 8 months ago

I noticed some of the processes were failing here and calling Exitcode 5. I wonder if maybe changing the switches has meant it is looking for ice thickness data in the restart which it wasn't looking for previously?

    ! 1.e Ice thickness interval
    !
    IF ( FLIC1 ) THEN
      IF ( TIC1(1) .GE. 0 ) THEN
        DTI10   = DSEC21 ( TIC1 , TI1 )
      ELSE
        DTI10   = 1.
      END IF
#ifdef W3_T
      WRITE (NDST,9015) DTI10
#endif
      IF ( DTI10 .LT. 0. ) THEN
        IF ( IAPROC .EQ. NAPERR ) WRITE (NDSE,1005)
        CALL EXTCDE ( 5 )
      END IF
    ELSE
      DTI10   = 0.
    END IF
    !

I guess there is something weird going on with TIC1 or TI1

Image              PC                Routine            Line        Source             
access-om3-MOM6-C  00000000049615EB  Unknown               Unknown  Unknown
libpthread-2.28.s  00001479C5AA7CF0  Unknown               Unknown  Unknown
libpthread-2.28.s  00001479C5AA345A  pthread_cond_wait     Unknown  Unknown
libopen-pal.so.40  00001479BFBDC40D  PMIx_Abort            Unknown  Unknown
libopen-pal.so.40  00001479BFC49250  pmix3x_abort          Unknown  Unknown
libopen-rte.so.40  00001479C099EE27  Unknown               Unknown  Unknown
libopen-rte.so.40  00001479C09B6BB6  orte_errmgr_base_     Unknown  Unknown
libopen-rte.so.40  00001479C09A695B  Unknown               Unknown  Unknown
libmpi.so.40.30.4  00001479C655181A  ompi_mpi_abort        Unknown  Unknown
libmpi_mpifh.so    00001479C68842DE  Unknown               Unknown  Unknown
access-om3-MOM6-C  0000000004529637  w3servmd_mp_extcd         865  w3servmd.F90
access-om3-MOM6-C  0000000004548F9C  w3wavemd_mp_w3wav         874  w3wavemd.F90
access-om3-MOM6-C  00000000043AECE1  wav_comp_nuopc_mp        1140  wav_comp_nuopc.F90
access-om3-MOM6-C  000000000200211F  _ZNK5ESMCI13Metho         377  ESMCI_MethodTable.C
access-om3-MOM6-C  0000000002002098  _ZN5ESMCI11Method         563  ESMCI_MethodTable.C
access-om3-MOM6-C  0000000002000B1B  c_esmc_methodtabl         317  ESMCI_MethodTable.C
ezhilsabareesh8 commented 7 months ago

Thanks for pointing out the error, @anton-seaice. It seems there's a discrepancy between the time stamps of TIC1 and TI1, leading to the error IF ( IAPROC .EQ. NAPERR ) WRITE (NDSE,1005) indicating a mismatch in WAVEWATCH III /' *** WAVEWATCH III ERROR IN W3WAVE :NEW IC1 FIELD BEFORE OLD IC1 FIELD '/.

Upon further investigation, it appears that the restart files lack IC1 (ice thickness) and IC5 (floe diameter), potentially causing this issue. I attempted to rectify this by configuring the extra fields to be written in the restart files using type%restart%extra = 'IC1 IC5' in the ww3_shel.nml file, but the error persists.

Additionally, diagnostics revealed that the ice thickness interval is invalid, as shown by the output:

TEST W3WAVE : DT IC1  =************
TEST W3WAVE : DT IC1  =************

Contrastingly, the ice concentration interval, which is correctly read from the restart file, is TEST W3WAVE : DT ICE = 3600.0.

I'm currently delving deeper into this issue.

aekiss commented 5 months ago

From today's meeting with @mvertens: are these relevant to this issue? They are fixed in the NorESM codebase.

  1. https://github.com/NorESMhub/CICE/issues/1
  2. https://github.com/NorESMhub/WW3/issues/6
  3. https://github.com/NorESMhub/WW3/issues/7 (same as this above?)
ezhilsabareesh8 commented 5 months ago

Thanks @aekiss and @mvertens. This fix has resolved the failing restart read issue in WW3 when the wave/ice coupling is enabled. I have created a patch for w3iorsmd.f90, will create PR in access-om3.