geoschem / gchp_legacy

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.
http://wiki.geos-chem.org/GEOS-Chem_HP
Other
7 stars 13 forks source link

[BUG/ISSUE] Not printing the missing HEMCO data file that causes model crash #52

Closed JiaweiZhuang closed 4 years ago

JiaweiZhuang commented 4 years ago

Description

I am testing two different scripts to download HEMCO data for GCHP 12.3.2, for my on-going paper https://github.com/JiaweiZhuang/cloud-gchp-paper.

The original HEMCO.sh downloads most of HEMCO directories, and only uses --exclude to skip very large ones. The new HEMCO_small.sh only downloads the minimum, exact files, using the log parser script at https://github.com/geoschem/geos-chem-cloud/issues/25#issuecomment-548188720.

The model runs successfully with the original, large dataset, but crashes with the new streamlined, smaller dataset, without printing relevant error messages. I set DEBUG_LEVEL: 5 in CAP.rc and Verbose: 3 in HEMCO_Config.rc, but the model still doesn't print which data file is missing. DEBUG_LEVEL: 20 doesn't dump more useful error messages.

Log files and error messages

Complete log files: Success run, with original script / large HEMCO data:

Failed run, with new script / small HEMCO data:

The error occurs at:

 Opened shortcut bracket: GFAS
   - Skip content of this bracket:  T
 Closed shortcut bracket: GFAS
   - Skip following lines:  F
===============================================================================
GEOS-Chem ERROR: Error encountered in "HCOX_Init"!
 -> at HCOI_GC_Init (in module GeosCore/hcoi_gc_main_mod.F90)

THIS ERROR ORIGINATED IN HEMCO!  Please check the HEMCO log file for 
additional error messages!
===============================================================================

The successful one will proceed to:

 Opened shortcut bracket: GFAS
   - Skip content of this bracket:  T
 Closed shortcut bracket: GFAS
   - Skip following lines:  F
   --> Isoprene to SOA-Precursor  1.500000000000000E-002
   --> Isoprene direct to SOA (Simple)  1.500000000000000E-002
   --> Monoterpene to SOA-Precursor  4.409171294611898E-002
   --> Monoterpene direct to SOA (Simple)  4.409171294611898E-002
   --> Othrterpene to SOA-Precursor  5.000000000000000E-002
   --> Othrterpene direct to SOA (Simple)  5.000000000000000E-002

File list

Here's the HEMCO-small directory, with the total size of 50G:

ACET   ALD2          BCOC_BOND  C2H6_2010    DUST_DEAD  EMEP   GMI     MEGAN    NEI2011   OFFLINE_LNOX   POET   SOILNOX  TIMEZONES  VOLCANO
AEIC   AnnualScalar  BIOFUEL    DICE_Africa  EDGARv42   GEIA   IODINE  MIX      NH3       OFFLINE_SSALT  RETRO  STRAT    TOMS_SBUV
AFCID  APEI          BROMINE    DMS          EDGARv43   GFED4  MASKS   NEI2005  NOAA_GMD  OMOC           SOA    STREETS  UVALBEDO

Here's the HEMCO-large directory, with the total size of 168G:

ACET          BIOBURN    CORBETT_SHIP  FINN         kgyr_to_kgm2s.sh  NEI2005            OFFLINE_LNOX   raw_data    STRAT      VISTAS
AEIC          BIOFUEL    COUNTRY_ID    GEIA         LIGHTNOX          NEI2011            OFFLINE_SFLUX  RCP         STREETS    VOLCANO
AFCID         BRAVO      DICE_Africa   GFED2        MACCITY           NEI2011_ag_only    OFFLINE_SSALT  README      TAGGED_CO  WEEKSCALE
ALD2          BROMINE    DMS           GFED3        MAP_A2A           NEI2011ek          OH             RETRO       TAGGED_O3  XIAO
AnnualScalar  C2H6_2010  DUST_DEAD     GFED4        MASAGE_NH3        NEI99              OLSON_MAP      RONO2       TIMEZONES  Yuan_XLAI
APEI          CAC        DUST_GINOUX   GMI          MASKS             NH3                OMOC           RRTMG       TNO
ARCTAS_SHIP   CEDS       EDGAR         grids        MEGAN             NOAA_GMD           OXIDANTS       SAMPLE_BCs  TOMS_SBUV
BB4CMIP6      CH3I       EDGARv42      HTAP         MERCURY           O3                 PARANOX        SF6         TrashEmis
BCOC_BOND     CHLA       EDGARv43      ICOADS_SHIP  MIX               OFFLINE_AEROSOL    POET           SOA         UCX
BCOC_COOKE    CO2        EMEP          IODINE       MODIS_XLAI        OFFLINE_LIGHTNING  POPs           SOILNOX     UVALBEDO

Is there a way to find out which directory is missing in the HEMCO-small one?

JiaweiZhuang commented 4 years ago

Got the exact same error if the entire HEMCO directory (MainDataDir in rundir) is set to an empty directory.

lizziel commented 4 years ago

The HEMCO log write appears to be interrupted during initialization of DICE_KEROSENE_OCPI. HEMCO is in the process of iterating over lines in HEMCO_Config.rc, in reverse order. Have you changed your HEMCO_Config.rc in any way?

Do you get a different entry in the HEMCO log if MainDataDir is set to an empty directory?

lizziel commented 4 years ago

I should clarify, HEMCO_Config.rc is already read, but now the linked list is being iterated over in reverse order to initialize.

JiaweiZhuang commented 4 years ago

Have you changed your HEMCO_Config.rc in any way?

No, except the verbose level.

Do you get a different entry in the HEMCO log if MainDataDir is set to an empty directory?

The ending entry is different each time and seems completely random. No matter an empty directory or the HEMCO-small directory is used. Two tries with identical configuration:

I guess another MPI process is causing the crash, at the same time the main process can be at a random stage.

lizziel commented 4 years ago

I recently had an issue in dev/12.7 where I did not have an updated HEMCO_Config.rc for a chemistry update that required new files, and I also got random location crashes, although not in this specific stage of HEMCO. It seems that there needs to be more error handling in HEMCO regarding missing files, not just in 12.3.2 but in the current version too.

Since the problem isn't easy to pinpoint for location, the best way forward may be compare files side-by-side, or brute force add them all back in and reduce with some kind of algorithm to minimize runs until you hone on the missing ones.

JiaweiZhuang commented 4 years ago

brute force add them all back in and reduce with some kind of algorithm to minimize runs until you hone on the missing ones.

I can't believe that I just did a manual binary search and found the missing file 😂! It's PARANOX.

Here are all the 55 files that exist in HEMCO-big but not HEMCO-small:

``` SAMPLE_BCs TAGGED_O3 OH SF6 WEEKSCALE HTAP CEDS NEI2011_ag_only GFED2 BRAVO RRTMG CH3I DUST_GINOUX MERCURY COUNTRY_ID OLSON_MAP NEI2011ek RCP CORBETT_SHIP O3 OXIDANTS CHLA MASAGE_NH3 OFFLINE_AEROSOL FINN PARANOX TrashEmis MODIS_XLAI grids BIOBURN MACCITY README CO2 ICOADS_SHIP MAP_A2A TNO ARCTAS_SHIP GFED3 BCOC_COOKE RONO2 POPs VISTAS TAGGED_CO LIGHTNOX EDGAR raw_data kgyr_to_kgm2s.sh NEI99 OFFLINE_SFLUX OFFLINE_LIGHTNING BB4CMIP6 XIAO UCX Yuan_XLAI CAC ```

A binary search will find the proper file in log2(55) ~ 6 steps. The search procedure:

  1. Select half (~22) of the files and put into a text file files_selected
  2. Inside the HEMCO_small directory, symlink those files from HEMCO_big:
cat files_selected | while read line 
do
   ln -s ../HEMCO_big/$line ./
done
  1. Run the model to see whether it crashes or succeeds

  2. No matter whether it crashes or not, remove the symlinks using the current files_selected before modifying it.

cat files_selected | while read line 
do
   rm $line
done
  1. If simulation succeeded, halve the current files_selected. If failed, change files_selected to the other half of the file list. Iterate (to step 2) until reaching a single file.
JiaweiZhuang commented 4 years ago

I still have no idea why PARANOX is causing the problem. It does not exist in either ExtData.rc or the output log.

The only place it appears is HEMCO_Config.rc:

102     ParaNOx                : on    NO/NO2/O3/HNO3
    --> LUT data format        :       txt
    --> LUT source dir         :       $ROOT/PARANOX/v2015-02

If GCHP is using this data, how does it get read? Should it actually appear in ExtData.rc? Maybe that's a bug in ExtData.rc?

JiaweiZhuang commented 4 years ago

It seems that there needs to be more error handling in HEMCO regarding missing files, not just in 12.3.2 but in the current version too.

As a minimum requirement, the model should probably print a meaningful error message if the entire HEMCO directory is empty.

lizziel commented 4 years ago

I can't believe that I just did a manual binary search and found the missing file

Great!

As a minimum requirement, the model should probably print a meaningful error message if the entire HEMCO directory is empty.

I think everyone will agree on that.

The issue you are having is occurring in HEMCO and not ExtData. Typically when a run fails so early and has to do with emissions it is a HEMCO issue, often a typo in the file but possibly other issues too with the setup. I'm just guessing here, but maybe it hit a problem reading the file on one of the threads. The code to read the text file is in hcox_paranox_mod.F90, and the call to paranox init is in hcox_driver_mod. All the error handling looks like this:

IF ( RC /= HCO_SUCCESS ) RETURN

So you aren't going to get any helpful messages about where it fails. Like GMAO is doing in MAPL and core GEOS-Chem, we should replace all of these with a message that prints prior to returning.

yantosca commented 4 years ago

@JiaweiZhuang, the PARANOX extension reads lookup tables that cannot be parsed by HEMCO's normal I/O.

In 12.6.1, we now should print all files that are read by GEOS-Chem and HEMCO to either the log file or the HEMCO.log file. I know you are using 12.3.2 but you can look at what I did in 12.6.1 to add the extra file printouts.

Also, yes -- many of the error trappings in HEMCO were originally added to just return but not throw an error. We need to fix those.

lizziel commented 4 years ago

@yantosca are those prints for files read only printed by root? If yes, this might still not catch the problematic file if running with MPI.

yantosca commented 4 years ago

@lizziel Yes, I think they are printed to root.

lizziel commented 4 years ago

I created https://github.com/geoschem/geos-chem/issues/119 to address the HEMCO error handling improvements needed to prevent this issue in the future.