geoschem / HEMCO

The Harmonized Emissions Component (HEMCO), developed by the GEOS-Chem Support Team.
https://hemco.readthedocs.io
Other
16 stars 32 forks source link

Error messages not printing to HEMCO.log #236

Open cbutenhoff opened 1 year ago

cbutenhoff commented 1 year ago

Name and Institution (Required)

Name: Chris Institution: Butenhoff

Confirm you have reviewed the following documentation

Description of your issue or question

I'm running HEMCO 3.0.0 within GCHP 13.1.2.

Apologies if this issue has been addressed by a newer version of HEMCO, but I believe I found a case where an HEMCO error message should have been written to the HEMCO.log file but wasn't.

My GCHP job failed with the following messages in the GCHP output file:

HEMCO: array pointer vertically flipped relative to MAPL Import ENERGY_LEVS
 HEMCO: array pointer vertically flipped relative to MAPL Import INDUSTRY_LEVS
  --- Do convection now
===============================================================================
GEOS-Chem ERROR: Error encountered in "HCO_Run"!
 -> at HCOI_GC_Run (in module GeosCore/hco_interface_gc_mod.F90)

THIS ERROR ORIGINATED IN HEMCO!  Please check the HEMCO log file for
additional error messages!
===============================================================================

===============================================================================
GEOS-Chem ERROR: Error encountered in "HCOI_GC_Run"!
 -> at Emissions_Run (in module GeosCore/emissions_mod.F90)
===============================================================================

pe=00002 FAIL at line=01201    gchp_chunk_mod.F90                       <Error calling EMISSIONS_RUN>
pe=00002 FAIL at line=03619    Chem_GridCompMod.F90                     <status=1>
pe=00002 FAIL at line=02672    Chem_GridCompMod.F90                     <status=1>
pe=00002 FAIL at line=01839    MAPL_Generic.F90                         <needs informative message>
pe=00002 FAIL at line=00573    GCHP_GridCompMod.F90                     <status=1>
pe=00002 FAIL at line=01839    MAPL_Generic.F90                         <needs informative message>
pe=00002 FAIL at line=01162    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=01086    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=00792    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=00922    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=00247    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00211    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00154    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00030    GCHPctm.F90                              <status=1>
===============================================================================
GEOS-Chem ERROR: Error encountered in "HCO_Run"!
 -> at HCOI_GC_Run (in module GeosCore/hco_interface_gc_mod.F90)

THIS ERROR ORIGINATED IN HEMCO!  Please check the HEMCO log file for
additional error messages!

As you can see, the message states that the error occurred in HEMCO and to check for error messages in the HEMCO log file. However, even when I set the Verbose and Warning flags to the maximum value of 3, there were no error messages in HEMCO.log,

I eventually tracked the error down to an erroneous negative flux in one of my custom emissions inventories that caused the array TmpFlux in subroutine Hco_CalcEmis to be negative and it entered this IF-clause:

 ! Check for negative values according to the corresponding setting
       ! in the configuration file: 2 means allow negative values, 1 means
       ! set to zero and prompt a warning, else return with error.
       IF ( HcoState%Options%NegFlag /= 2 ) THEN
          ! CLB 08/21/2023
          PRINT*, 'CLB in hco_calc_mod.F90: CalcEmis4.1'

          IF ( ANY(TmpFlx < 0.0_hp) ) THEN

             ! CLB 08/21/2023
             PRINT*, 'CLB in hco_calc_mod.F90: CalcEmis4.2'

             ! Set to zero and prompt warning
             IF ( HcoState%Options%NegFlag == 1 ) THEN
                ! CLB 08/21/2023
                PRINT*, 'CLB in hco_calc_mod.F90: CalcEmis4.3'
                WHERE ( TmpFlx < 0.0_hp ) TmpFlx = 0.0_hp
                MSG = 'Negative emissions set to zero: '// TRIM(Dct%cName)
                CALL HCO_WARNING( HcoState%Config%Err, MSG, RC )

             ! Return with error
             ELSE
                ! CLB 08/21/2023
                PRINT*, 'CLB in hco_calc_mod.F90: CalcEmis4.4', Dct%HcoID
                MSG = 'Negative emissions in: '// TRIM(Dct%cName) // '. ' // &
                'To allow negatives, edit settings in the configuration file.'
                CALL HCO_ERROR( HcoState%Config%Err, MSG, RC )
                RETURN
             ENDIF
          ENDIF
       ENDIF

Since I didn't have the Negative values flag set to 1 in HEMCO_Config.rc, HEMCO should have written the error message in the ELSE clause.

The routine HCO_ERROR calls HCO_MSGErr:

SUBROUTINE HCO_MSGErr( Err, Msg, Sep1, Sep2, Verb )
!
! !INPUT PARAMETERS:
!
    TYPE(HcoErr),     POINTER                  :: Err
    CHARACTER(LEN=*), INTENT(IN   ), OPTIONAL  :: Msg
    CHARACTER(LEN=1), INTENT(IN   ), OPTIONAL  :: Sep1
    CHARACTER(LEN=1), INTENT(IN   ), OPTIONAL  :: Sep2
    INTEGER,          INTENT(IN   ), OPTIONAL  :: Verb
!
! !REVISION HISTORY:
!  23 Sep 2013 - C. Keller   - Initialization
!  See https://github.com/geoschem/hemco for complete history
!EOP
!------------------------------------------------------------------------------
!BOC
    LOGICAL  :: IsOpen
    INTEGER  :: LUN

    !======================================================================
    ! HCO_MSG begins here
    !======================================================================

    ! Check if Err object is indeed defined
    IF ( .NOT. ASSOCIATED(Err) ) THEN
       IsOpen = .FALSE.
    ELSE
       IsOpen = Err%LogIsOpen

    ! CLB 08/21/2023
       PRINT*, 'CLB In hco_error_mod.F90 l434', Msg

       ! Don't print if this is not the root CPU
       IF ( .NOT. Err%IsRoot ) RETURN

       ! CLB 08/21/2023
       PRINT*, 'CLB In hco_error_mod.F90 l440', Msg

       ! Don't print if verbose level is smaller than verbose level of this
       ! CPU.
       IF ( PRESENT( Verb ) ) THEN
          IF ( Verb < Err%Verbose ) RETURN
       ENDIF
    ENDIF

You can see the MSG is not printed if this is not the root CPU. I was running GCHP using multiple nodes/cores so apparently the process running this was not on the root CPU. When I comment this line out, the MSG is written to GCHP.out (or standard output) but not HEMCO.log.

I'm not sure how often this issue comes up, but fixing it could save considerable debugging time.

Thanks.

jimmielin commented 1 year ago

Hi, thanks for reporting this bug. I think that a lot of the HEMCO error handling is still reliant on the root thread and if something crashes at a non-root process it is very hard to debug.

I think to make HEMCO errors more MPI-safe it might be better to output HEMCO.log into separate files in the MPI environment, i.e., HEMCO.log.0000, HEMCO.log.0001, ... but there would be some performance impacts of I/O on non-root threads. Alternatively maybe the modeling framework would provide an I/O handle to print errors in (CESM captures standard output into one single file in cesm.log.... for example) but this is not universal (WRF still writes into separate files despite capturing STDOUT and STDERR). Not sure about MAPL so I am pinging @lizziel who knows much more about MAPL than I do. But we may have to work together to find a solution that works for all models coupling with HEMCO.