Closed JiaweiZhuang closed 5 years ago
Got the exact same error if the entire HEMCO
directory (MainDataDir
in rundir) is set to an empty directory.
The HEMCO log write appears to be interrupted during initialization of DICE_KEROSENE_OCPI. HEMCO is in the process of iterating over lines in HEMCO_Config.rc, in reverse order. Have you changed your HEMCO_Config.rc in any way?
Do you get a different entry in the HEMCO log if MainDataDir is set to an empty directory?
I should clarify, HEMCO_Config.rc is already read, but now the linked list is being iterated over in reverse order to initialize.
Have you changed your HEMCO_Config.rc in any way?
No, except the verbose
level.
Do you get a different entry in the HEMCO log if MainDataDir is set to an empty directory?
The ending entry is different each time and seems completely random. No matter an empty directory or the HEMCO-small directory is used. Two tries with identical configuration:
Container DICE_GASFLARE_ALK4
Container DICE_AGBURNING_HCOOH
I guess another MPI process is causing the crash, at the same time the main process can be at a random stage.
I recently had an issue in dev/12.7 where I did not have an updated HEMCO_Config.rc for a chemistry update that required new files, and I also got random location crashes, although not in this specific stage of HEMCO. It seems that there needs to be more error handling in HEMCO regarding missing files, not just in 12.3.2 but in the current version too.
Since the problem isn't easy to pinpoint for location, the best way forward may be compare files side-by-side, or brute force add them all back in and reduce with some kind of algorithm to minimize runs until you hone on the missing ones.
brute force add them all back in and reduce with some kind of algorithm to minimize runs until you hone on the missing ones.
I can't believe that I just did a manual binary search and found the missing file 😂! It's PARANOX
.
Here are all the 55 files that exist in HEMCO-big but not HEMCO-small:
A binary search will find the proper file in log2(55) ~ 6
steps. The search procedure:
files_selected
HEMCO_small
directory, symlink those files from HEMCO_big
:cat files_selected | while read line
do
ln -s ../HEMCO_big/$line ./
done
Run the model to see whether it crashes or succeeds
No matter whether it crashes or not, remove the symlinks using the current files_selected
before modifying it.
cat files_selected | while read line
do
rm $line
done
files_selected
. If failed, change files_selected
to the other half of the file list. Iterate (to step 2) until reaching a single file.I still have no idea why PARANOX
is causing the problem. It does not exist in either ExtData.rc
or the output log.
The only place it appears is HEMCO_Config.rc
:
102 ParaNOx : on NO/NO2/O3/HNO3
--> LUT data format : txt
--> LUT source dir : $ROOT/PARANOX/v2015-02
If GCHP is using this data, how does it get read? Should it actually appear in ExtData.rc
? Maybe that's a bug in ExtData.rc
?
It seems that there needs to be more error handling in HEMCO regarding missing files, not just in 12.3.2 but in the current version too.
As a minimum requirement, the model should probably print a meaningful error message if the entire HEMCO
directory is empty.
I can't believe that I just did a manual binary search and found the missing file
Great!
As a minimum requirement, the model should probably print a meaningful error message if the entire HEMCO directory is empty.
I think everyone will agree on that.
The issue you are having is occurring in HEMCO and not ExtData. Typically when a run fails so early and has to do with emissions it is a HEMCO issue, often a typo in the file but possibly other issues too with the setup. I'm just guessing here, but maybe it hit a problem reading the file on one of the threads. The code to read the text file is in hcox_paranox_mod.F90, and the call to paranox init is in hcox_driver_mod. All the error handling looks like this:
IF ( RC /= HCO_SUCCESS ) RETURN
So you aren't going to get any helpful messages about where it fails. Like GMAO is doing in MAPL and core GEOS-Chem, we should replace all of these with a message that prints prior to returning.
@JiaweiZhuang, the PARANOX extension reads lookup tables that cannot be parsed by HEMCO's normal I/O.
In 12.6.1, we now should print all files that are read by GEOS-Chem and HEMCO to either the log file or the HEMCO.log file. I know you are using 12.3.2 but you can look at what I did in 12.6.1 to add the extra file printouts.
Also, yes -- many of the error trappings in HEMCO were originally added to just return but not throw an error. We need to fix those.
@yantosca are those prints for files read only printed by root? If yes, this might still not catch the problematic file if running with MPI.
@lizziel Yes, I think they are printed to root.
I created https://github.com/geoschem/geos-chem/issues/119 to address the HEMCO error handling improvements needed to prevent this issue in the future.
Description
I am testing two different scripts to download HEMCO data for GCHP 12.3.2, for my on-going paper https://github.com/JiaweiZhuang/cloud-gchp-paper.
The original HEMCO.sh downloads most of HEMCO directories, and only uses
--exclude
to skip very large ones. The new HEMCO_small.sh only downloads the minimum, exact files, using the log parser script at https://github.com/geoschem/geos-chem-cloud/issues/25#issuecomment-548188720.The model runs successfully with the original, large dataset, but crashes with the new streamlined, smaller dataset, without printing relevant error messages. I set
DEBUG_LEVEL: 5
inCAP.rc
andVerbose: 3
inHEMCO_Config.rc
, but the model still doesn't print which data file is missing.DEBUG_LEVEL: 20
doesn't dump more useful error messages.Log files and error messages
Complete log files: Success run, with original script / large HEMCO data:
Failed run, with new script / small HEMCO data:
The error occurs at:
The successful one will proceed to:
File list
Here's the HEMCO-small directory, with the total size of 50G:
Here's the HEMCO-large directory, with the total size of 168G:
Is there a way to find out which directory is missing in the HEMCO-small one?