GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
26 stars 18 forks source link

Feature request: Better error messages when simulations die due to restart file issues #2416

Open yantosca opened 8 months ago

yantosca commented 8 months ago

I was doing some work in GCHP and found the following MAPL-related issues:

  1. If the initial restart file is missing, the simulation will crash but not print out a message such as "Restart/checkpoint file not found". The last thing the logging output shows is that the Dynamics is trying to be initialized.

  2. If the initial restart file is present, but if bootstrapping of missing species is disabled, then the same situation occurs.... the run dies w/o an appropriate error message.

It would be great if there could be a graceful exit with an error message to let users know that the issue is in the restart file. This would save a lot of time and energy in debugging.

Tagging @lizziel

bena-nasa commented 8 months ago

@yantosca I can clean this up, provide better message.

tclune commented 8 months ago

Yes - we generally want any failures related to files to identify the file involved. But ... we have been lazy in that we wait for users to identify specific scenarios.

tclune commented 8 months ago

Bob - do you need a backport to a specific MAPL version?

yantosca commented 8 months ago

@tclune: I think we are using MAPL 2.26 for now (@lizziel can confirm). We will update to MAPL 3 when it is ready.

lizziel commented 8 months ago

Yes, we are using 2.26.

mathomp4 commented 8 months ago

@tclune: I think we are using MAPL 2.26 for now (@lizziel can confirm). We will update to MAPL 3 when it is ready.

Well, that will be a while in the future and a big change. :)

But dang. MAPL 2.26. I haven't heard that in a long time! (As I get ready to release MAPL 2.42...)

bena-nasa commented 8 months ago

@yantosca @lizziel For the first part, if the file just doesn't not exist, I just did an experiment and removed the dynamics restart (in our case it is named fvcore_internal_rst). The code does appear to be printing out a message saying the restart is not there, this is what I see:

 ERROR: Required restart fvcore_internal_rst does not exist!
pe=00000 FAIL at line=06173    MAPL_Generic.F90                         <unknown error>
pe=00008 FAIL at line=06173    MAPL_Generic.F90                         <unknown error>
pe=00008 FAIL at line=01673    MAPL_Generic.F90                         <unknown error>
pe=00008 FAIL at line=01107    MAPL_Generic.F90                         <status=-1>

You see it saying that dynamics it trying to be initialized is because the restart reading is part of the initialization. This message is in an if block, if the file does not exist spits out that message returns the failure resulting in a failure up the stack at some point ending up in the initialize of dynamics during the failure cascade in this case for example. https://github.com/GEOS-ESM/MAPL/blob/v2.26.0/generic/MAPL_Generic.F90#L5983

Do you see this message and missed it perhaps? I see that MAPL 2.26 has this message so it is not something new after that version.

yantosca commented 8 months ago
gchp.20190101_0000z.log:1:Restart symlink gchp_restart.nc4 set to ./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4
gchp.20190101_0000z.log:17: NOT using buffer I/O for file: cap_restart
gchp.20190101_0000z.log:18:            CAP: INFO: Read CAP restart properly, Current Date =   2019/01/01
gchp.20190101_0000z.log:275: Character Resource Parameter: GCHPchem_INTERNAL_RESTART_FILE:gchp_restart.nc4
gchp.20190101_0000z.log:277: Using parallel NetCDF for file: gchp_restart.nc4
gchp.20190101_0000z.log:278:   Bootstrapping Variable: ARCHV_DRY_TOTN in gchp_restart.nc4
gchp.20190101_0000z.log:279:   Bootstrapping Variable: ARCHV_WET_TOTN in gchp_restart.nc4
gchp.20190101_0000z.log:280:   Bootstrapping Variable: AREA in gchp_restart.nc4
gchp.20190101_0000z.log:281:   Bootstrapping Variable: AeroH2O_SNA in gchp_restart.nc4
gchp.20190101_0000z.log:282:   Bootstrapping Variable: DEP_RESERVOIR in gchp_restart.nc4
gchp.20190101_0000z.log:283:   Bootstrapping Variable: DRYPERIOD in gchp_restart.nc4
gchp.20190101_0000z.log:284:   Bootstrapping Variable: DryDepNitrogen in gchp_restart.nc4
gchp.20190101_0000z.log:285:   Bootstrapping Variable: GCCTROPP in gchp_restart.nc4
gchp.20190101_0000z.log:286:   Bootstrapping Variable: GWET_PREV in gchp_restart.nc4
gchp.20190101_0000z.log:287:   Bootstrapping Variable: H2O2AfterChem in gchp_restart.nc4
gchp.20190101_0000z.log:288:   Bootstrapping Variable: JNO2 in gchp_restart.nc4
gchp.20190101_0000z.log:289:   Bootstrapping Variable: JOH in gchp_restart.nc4
gchp.20190101_0000z.log:290:   Bootstrapping Variable: KPPHvalue in gchp_restart.nc4
gchp.20190101_0000z.log:291:   Bootstrapping Variable: LAI_PREVDAY in gchp_restart.nc4
gchp.20190101_0000z.log:292:   Bootstrapping Variable: ORVCSESQ in gchp_restart.nc4
gchp.20190101_0000z.log:293:   Bootstrapping Variable: PARDF_DAVG in gchp_restart.nc4
gchp.20190101_0000z.log:294:   Bootstrapping Variable: PARDR_DAVG in gchp_restart.nc4
gchp.20190101_0000z.log:295:   Bootstrapping Variable: PFACTOR in gchp_restart.nc4
gchp.20190101_0000z.log:296:   Bootstrapping Variable: SO2AfterChem in gchp_restart.nc4
gchp.20190101_0000z.log:297:   Bootstrapping Variable: STATE_PSC in gchp_restart.nc4
gchp.20190101_0000z.log:298:   Bootstrapping Variable: T_DAVG in gchp_restart.nc4
gchp.20190101_0000z.log:299:   Bootstrapping Variable: T_PREVDAY in gchp_restart.nc4
gchp.20190101_0000z.log:300:   Bootstrapping Variable: WetDepNitrogen in gchp_restart.nc4
slurm-6471802.out:1:++ sed 's/ /_/g' cap_restart
slurm-6471802.out:31:++ Require_Species_in_Restart=0
slurm-6471802.out:122:++ print_msg 'WARNING: write restarts by o-server is disabled since <1000 cores'
slurm-6471802.out:124:++ replace_val WRITE_RESTART_BY_OSERVER NO GCHP.rc
slurm-6471802.out:125:++ KEY=WRITE_RESTART_BY_OSERVER
slurm-6471802.out:131:++ sed 's|^\([\t ]*WRITE_RESTART_BY_OSERVER[\t ]*:[\t ]*\).*|\1NO|' GCHP.rc
slurm-6471802.out:511:++ print_msg 'Initial restart settings:'
slurm-6471802.out:515:++ replace_val INITIAL_RESTART_SPECIES_REQUIRED 0 GCHP.rc
slurm-6471802.out:516:++ KEY=INITIAL_RESTART_SPECIES_REQUIRED
slurm-6471802.out:522:++ sed 's|^\([\t ]*INITIAL_RESTART_SPECIES_REQUIRED[\t ]*:[\t ]*\).*|\10|' GCHP.rc
slurm-6471802.out:551:+ source setRestartLink.sh
slurm-6471802.out:552:++ rst_link_name=gchp_restart.nc4
slurm-6471802.out:553:++ '[' -f cap_restart ']'
slurm-6471802.out:554:+++ sed 's/ /_/g' cap_restart
slurm-6471802.out:560:++ rst_target=./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4
slurm-6471802.out:561:++ [[ -f ./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4 ]]
slurm-6471802.out:562:++ ln -nsf ./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4 gchp_restart.nc4
slurm-6471802.out:563:++ echo 'Restart symlink gchp_restart.nc4 set to ./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4'
slurm-6471802.out:573:Shell debugging restarted
slurm-6471802.out:586:Shell debugging restarted
slurm-6471802.out:596:Shell debugging restarted
slurm-6471802.out:606:Shell debugging restarted
slurm-6471802.out:616:Shell debugging restarted
slurm-6471802.out:626:Shell debugging restarted
slurm-6471802.out:636:Shell debugging restarted
slurm-6471802.out:682:Shell debugging restarted

@bena-nasa. I've grepped for "restart" in the *.log and *.out files and haven't seen this message. Maybe it isn't in this version.

yantosca commented 8 months ago

Log files from the simulation: allPEs.log.txt.log gchp.20190101_0000z.log.txt.log logfile.000000.out.txt slurm-6471802.out.txt

lizziel commented 8 months ago

@yantosca, the slurm log file you posted contains this which indicates it is a problem with imports. I wonder if the logs you posted are from a different run? pe=00000 FAIL at line=00803 ExtDataGridCompMod.F90 <Found 157 unfulfilled imports in extdata>

lizziel commented 8 months ago

I looked into this and see we have the code in MAPL that would print a file not found message: https://github.com/geoschem/MAPL/blob/277e83f60878a7c896757d8d41eab90a0cb2bab3/generic/MAPL_Generic.F90#L5983-L5990

I think what is happening is we don't meet the criteria to trigger it. I haven't tested this I but I think this is because the ESMF state does not have attribute MAPL_RestartRequired set. If this is the case, and bootstrapping is enabled, then a missing file would be allowed. I am digging around to see where we can set this attribute for the ESMF state.

stale[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

mathomp4 commented 6 months ago

Hmm. I'm not sure if this has been fixed...or still needs to? @lizziel Is this still an issue/desire for you?

lizziel commented 6 months ago

@mathomp4, I need to test again to see if I can report more info for you. Stay tuned.

stale[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

lizziel commented 4 months ago

I still want to keep this open as a feature request for MAPL 3. I will test using MAPL 2.26 (what we use in GCHP) and report on if the current error message can be improved. The criteria are the following:

  1. If a restart file missing have an error message that makes it obviously clear the file was not found, and which file it was looking for.
  2. If bootstrapping is disabled and missing species is not found in restart file then exit with an error message that clearly states the restart variable was not found.

I'll report back soon on this (famous last words, I know).

tclune commented 4 months ago

Definitely reasonable expectations ...

mathomp4 commented 4 months ago

I'll assign/ping @bena-nasa and @atrayano . They have the best shot.