Open yantosca opened 8 months ago
@yantosca I can clean this up, provide better message.
Yes - we generally want any failures related to files to identify the file involved. But ... we have been lazy in that we wait for users to identify specific scenarios.
Bob - do you need a backport to a specific MAPL version?
@tclune: I think we are using MAPL 2.26 for now (@lizziel can confirm). We will update to MAPL 3 when it is ready.
Yes, we are using 2.26.
@tclune: I think we are using MAPL 2.26 for now (@lizziel can confirm). We will update to MAPL 3 when it is ready.
Well, that will be a while in the future and a big change. :)
But dang. MAPL 2.26. I haven't heard that in a long time! (As I get ready to release MAPL 2.42...)
@yantosca @lizziel For the first part, if the file just doesn't not exist, I just did an experiment and removed the dynamics restart (in our case it is named fvcore_internal_rst). The code does appear to be printing out a message saying the restart is not there, this is what I see:
ERROR: Required restart fvcore_internal_rst does not exist!
pe=00000 FAIL at line=06173 MAPL_Generic.F90 <unknown error>
pe=00008 FAIL at line=06173 MAPL_Generic.F90 <unknown error>
pe=00008 FAIL at line=01673 MAPL_Generic.F90 <unknown error>
pe=00008 FAIL at line=01107 MAPL_Generic.F90 <status=-1>
You see it saying that dynamics it trying to be initialized is because the restart reading is part of the initialization. This message is in an if block, if the file does not exist spits out that message returns the failure resulting in a failure up the stack at some point ending up in the initialize of dynamics during the failure cascade in this case for example. https://github.com/GEOS-ESM/MAPL/blob/v2.26.0/generic/MAPL_Generic.F90#L5983
Do you see this message and missed it perhaps? I see that MAPL 2.26 has this message so it is not something new after that version.
gchp.20190101_0000z.log:1:Restart symlink gchp_restart.nc4 set to ./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4
gchp.20190101_0000z.log:17: NOT using buffer I/O for file: cap_restart
gchp.20190101_0000z.log:18: CAP: INFO: Read CAP restart properly, Current Date = 2019/01/01
gchp.20190101_0000z.log:275: Character Resource Parameter: GCHPchem_INTERNAL_RESTART_FILE:gchp_restart.nc4
gchp.20190101_0000z.log:277: Using parallel NetCDF for file: gchp_restart.nc4
gchp.20190101_0000z.log:278: Bootstrapping Variable: ARCHV_DRY_TOTN in gchp_restart.nc4
gchp.20190101_0000z.log:279: Bootstrapping Variable: ARCHV_WET_TOTN in gchp_restart.nc4
gchp.20190101_0000z.log:280: Bootstrapping Variable: AREA in gchp_restart.nc4
gchp.20190101_0000z.log:281: Bootstrapping Variable: AeroH2O_SNA in gchp_restart.nc4
gchp.20190101_0000z.log:282: Bootstrapping Variable: DEP_RESERVOIR in gchp_restart.nc4
gchp.20190101_0000z.log:283: Bootstrapping Variable: DRYPERIOD in gchp_restart.nc4
gchp.20190101_0000z.log:284: Bootstrapping Variable: DryDepNitrogen in gchp_restart.nc4
gchp.20190101_0000z.log:285: Bootstrapping Variable: GCCTROPP in gchp_restart.nc4
gchp.20190101_0000z.log:286: Bootstrapping Variable: GWET_PREV in gchp_restart.nc4
gchp.20190101_0000z.log:287: Bootstrapping Variable: H2O2AfterChem in gchp_restart.nc4
gchp.20190101_0000z.log:288: Bootstrapping Variable: JNO2 in gchp_restart.nc4
gchp.20190101_0000z.log:289: Bootstrapping Variable: JOH in gchp_restart.nc4
gchp.20190101_0000z.log:290: Bootstrapping Variable: KPPHvalue in gchp_restart.nc4
gchp.20190101_0000z.log:291: Bootstrapping Variable: LAI_PREVDAY in gchp_restart.nc4
gchp.20190101_0000z.log:292: Bootstrapping Variable: ORVCSESQ in gchp_restart.nc4
gchp.20190101_0000z.log:293: Bootstrapping Variable: PARDF_DAVG in gchp_restart.nc4
gchp.20190101_0000z.log:294: Bootstrapping Variable: PARDR_DAVG in gchp_restart.nc4
gchp.20190101_0000z.log:295: Bootstrapping Variable: PFACTOR in gchp_restart.nc4
gchp.20190101_0000z.log:296: Bootstrapping Variable: SO2AfterChem in gchp_restart.nc4
gchp.20190101_0000z.log:297: Bootstrapping Variable: STATE_PSC in gchp_restart.nc4
gchp.20190101_0000z.log:298: Bootstrapping Variable: T_DAVG in gchp_restart.nc4
gchp.20190101_0000z.log:299: Bootstrapping Variable: T_PREVDAY in gchp_restart.nc4
gchp.20190101_0000z.log:300: Bootstrapping Variable: WetDepNitrogen in gchp_restart.nc4
slurm-6471802.out:1:++ sed 's/ /_/g' cap_restart
slurm-6471802.out:31:++ Require_Species_in_Restart=0
slurm-6471802.out:122:++ print_msg 'WARNING: write restarts by o-server is disabled since <1000 cores'
slurm-6471802.out:124:++ replace_val WRITE_RESTART_BY_OSERVER NO GCHP.rc
slurm-6471802.out:125:++ KEY=WRITE_RESTART_BY_OSERVER
slurm-6471802.out:131:++ sed 's|^\([\t ]*WRITE_RESTART_BY_OSERVER[\t ]*:[\t ]*\).*|\1NO|' GCHP.rc
slurm-6471802.out:511:++ print_msg 'Initial restart settings:'
slurm-6471802.out:515:++ replace_val INITIAL_RESTART_SPECIES_REQUIRED 0 GCHP.rc
slurm-6471802.out:516:++ KEY=INITIAL_RESTART_SPECIES_REQUIRED
slurm-6471802.out:522:++ sed 's|^\([\t ]*INITIAL_RESTART_SPECIES_REQUIRED[\t ]*:[\t ]*\).*|\10|' GCHP.rc
slurm-6471802.out:551:+ source setRestartLink.sh
slurm-6471802.out:552:++ rst_link_name=gchp_restart.nc4
slurm-6471802.out:553:++ '[' -f cap_restart ']'
slurm-6471802.out:554:+++ sed 's/ /_/g' cap_restart
slurm-6471802.out:560:++ rst_target=./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4
slurm-6471802.out:561:++ [[ -f ./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4 ]]
slurm-6471802.out:562:++ ln -nsf ./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4 gchp_restart.nc4
slurm-6471802.out:563:++ echo 'Restart symlink gchp_restart.nc4 set to ./Restarts/GEOSChem.Restart.20190101_0000z.c24.nc4'
slurm-6471802.out:573:Shell debugging restarted
slurm-6471802.out:586:Shell debugging restarted
slurm-6471802.out:596:Shell debugging restarted
slurm-6471802.out:606:Shell debugging restarted
slurm-6471802.out:616:Shell debugging restarted
slurm-6471802.out:626:Shell debugging restarted
slurm-6471802.out:636:Shell debugging restarted
slurm-6471802.out:682:Shell debugging restarted
@bena-nasa. I've grepped for "restart" in the *.log
and *.out
files and haven't seen this message. Maybe it isn't in this version.
Log files from the simulation: allPEs.log.txt.log gchp.20190101_0000z.log.txt.log logfile.000000.out.txt slurm-6471802.out.txt
@yantosca, the slurm log file you posted contains this which indicates it is a problem with imports. I wonder if the logs you posted are from a different run?
pe=00000 FAIL at line=00803 ExtDataGridCompMod.F90 <Found 157 unfulfilled imports in extdata>
I looked into this and see we have the code in MAPL that would print a file not found message: https://github.com/geoschem/MAPL/blob/277e83f60878a7c896757d8d41eab90a0cb2bab3/generic/MAPL_Generic.F90#L5983-L5990
I think what is happening is we don't meet the criteria to trigger it. I haven't tested this I but I think this is because the ESMF state does not have attribute MAPL_RestartRequired
set. If this is the case, and bootstrapping is enabled, then a missing file would be allowed. I am digging around to see where we can set this attribute for the ESMF state.
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.
Hmm. I'm not sure if this has been fixed...or still needs to? @lizziel Is this still an issue/desire for you?
@mathomp4, I need to test again to see if I can report more info for you. Stay tuned.
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.
I still want to keep this open as a feature request for MAPL 3. I will test using MAPL 2.26 (what we use in GCHP) and report on if the current error message can be improved. The criteria are the following:
I'll report back soon on this (famous last words, I know).
Definitely reasonable expectations ...
I'll assign/ping @bena-nasa and @atrayano . They have the best shot.
I was doing some work in GCHP and found the following MAPL-related issues:
If the initial restart file is missing, the simulation will crash but not print out a message such as "Restart/checkpoint file not found". The last thing the logging output shows is that the Dynamics is trying to be initialized.
If the initial restart file is present, but if bootstrapping of missing species is disabled, then the same situation occurs.... the run dies w/o an appropriate error message.
It would be great if there could be a graceful exit with an error message to let users know that the issue is in the restart file. This would save a lot of time and energy in debugging.
Tagging @lizziel