Open cbutenhoff opened 1 year ago
Hi, thanks for reporting this bug. I think that a lot of the HEMCO error handling is still reliant on the root thread and if something crashes at a non-root process it is very hard to debug.
I think to make HEMCO errors more MPI-safe it might be better to output HEMCO.log
into separate files in the MPI environment, i.e., HEMCO.log.0000
, HEMCO.log.0001
, ... but there would be some performance impacts of I/O on non-root threads. Alternatively maybe the modeling framework would provide an I/O handle to print errors in (CESM captures standard output into one single file in cesm.log....
for example) but this is not universal (WRF still writes into separate files despite capturing STDOUT and STDERR). Not sure about MAPL so I am pinging @lizziel who knows much more about MAPL than I do. But we may have to work together to find a solution that works for all models coupling with HEMCO.
Name and Institution (Required)
Name: Chris Institution: Butenhoff
Confirm you have reviewed the following documentation
Description of your issue or question
I'm running HEMCO 3.0.0 within GCHP 13.1.2.
Apologies if this issue has been addressed by a newer version of HEMCO, but I believe I found a case where an HEMCO error message should have been written to the HEMCO.log file but wasn't.
My GCHP job failed with the following messages in the GCHP output file:
As you can see, the message states that the error occurred in HEMCO and to check for error messages in the HEMCO log file. However, even when I set the
Verbose
andWarning
flags to the maximum value of 3, there were no error messages in HEMCO.log,I eventually tracked the error down to an erroneous negative flux in one of my custom emissions inventories that caused the array
TmpFlux
in subroutineHco_CalcEmis
to be negative and it entered this IF-clause:Since I didn't have the
Negative values
flag set to 1 in HEMCO_Config.rc, HEMCO should have written the error message in theELSE
clause.The routine
HCO_ERROR
callsHCO_MSGErr:
You can see the MSG is not printed if
this is not the root CPU
. I was running GCHP using multiple nodes/cores so apparently the process running this was not on the root CPU. When I comment this line out, the MSG is written toGCHP.out
(or standard output) but notHEMCO.log.
I'm not sure how often this issue comes up, but fixing it could save considerable debugging time.
Thanks.