NOAA-EMC / EVS

12 stars 27 forks source link

global_ens: add restart capability to atmos grid2grid and precip stats jobs #604

Open GwenChen-NOAA opened 1 week ago

GwenChen-NOAA commented 1 week ago

Description of Changes

This PR adds restart capability to atmos grid2grid and precip stats jobs for the global_ens component. This PR addresses Issue #532.

Developer Questions and Checklist

Testing Instructions

(1) Set up jobs a. Symlink the EVS_fix directory locally as "fix". b. Copy the exec directory from EVS prod package: cp -r /lfs/h1/ops/prod/packages/evs.v1.0.13/exec $HOMEevs

c. In the driver scripts, edit the following environment variables:

HOMEevs - set to your test EVS directory COMIN - set to /lfs/h2/emc/vpppg/noscrub/emc.vpppg/${NET}/$evs_ver_2d COMOUT - set to your test output directory

(2) Run jobs Run the following jobs in EVS/dev/drivers/scripts/stats/global_ens:

qsub jevs_global_ens_cmce_atmos_grid2grid_stats.sh qsub jevs_global_ens_ecme_atmos_grid2grid_stats.sh qsub jevs_global_ens_gefs_atmos_grid2grid_stats.sh qsub jevs_global_ens_naefs_atmos_grid2grid_stats.sh qsub jevs_global_ens_cmce_atmos_precip_stats.sh qsub jevs_global_ens_ecme_atmos_precip_stats.sh qsub jevs_global_ens_gefs_atmos_precip_stats.sh qsub jevs_global_ens_naefs_atmos_precip_stats.sh

Log files should be checked for free of errors.

(3) Test restart capability After a successful full run, save the final stat file for comparison. Run the job again and randomly stop the job using qdel. Then, resubmit the job using qsub. The resubmitted job should run shorter than the full run, and the final stat file from the resubmitted job should be the same from the full run. Log file from the resubmitted job should also be free of errors.

malloryprow commented 1 week ago

Full Test

Jobs for the full test have been submitted. COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0.

jevs_global_ens_cmce_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o206752404 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.206752404.dbqs01

jevs_global_ens_ecme_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o206752480 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_grid2grid_stats.206752480.dbqs01

jevs_global_ens_gefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206752548 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206752548.dbqs01

jevs_global_ens_naefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206752662 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.206752662.dbqs01

jevs_global_ens_cmce_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o206752755 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_precip_stats.206752755.dbqs01

jevs_global_ens_ecme_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o206752988 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_precip_stats.206752988.dbqs01

jevs_global_ens_gefs_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_precip_stats.o206753216 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_precip_stats.206753216.dbqs01

jevs_global_ens_naefs_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o206753242 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_precip_stats.206753242.dbqs01

GwenChen-NOAA commented 1 week ago

@malloryprow, please copy the exec directory from EVS prod package (cp -r /lfs/h1/ops/prod/packages/evs.v1.0.13/exec $HOMEevs) and then rerun the following two jobs:

jevs_global_ens_gefs_atmos_grid2grid_stats.sh (may need to increase memory) jevs_global_ens_naefs_atmos_grid2grid_stats.sh

These two jobs need the exec/evs_g2g_adjustCMC.x file to run. All other jobs have successfully completed.

malloryprow commented 6 days ago

jevs_global_ens_gefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206855987 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206855987.dbqs01

jevs_global_ens_naefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206856015 DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206855987.dbqs01

GwenChen-NOAA commented 6 days ago

All jobs have completed successfully. Please do the restart test.

malloryprow commented 6 days ago

Will do!

The memory for jevs_global_ens_cmce_atmos_precip_stats.sh quite high (mem=100GB) for what it is using (update_job_usage: Memory usage: mem=2568832kb). I think it can be 10GB like the ecme and naefs jobs. Please update the dev driver and ecf script!

malloryprow commented 6 days ago

Restart

I moved /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs into /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/full_test. COMOUT for the restart testing will be /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0.

I believe a few restart grid2grid jobs are still running.

jevs_global_ens_cmce_atmos_grid2grid_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o206871666 Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.206871666.dbqs01 Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o206872712 Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.206872712.dbqs01

jevs_global_ens_cmce_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o206872251 Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_precip_stats.206872251.dbqs01 Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o206872739 Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_precip_stats.206872739.dbqs01

jevs_global_ens_ecme_atmos_grid2grid_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o206873673 Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_grid2grid_stats.206873673.dbqs01 Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o206875508 Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_grid2grid_stats.206875508.dbqs01

jevs_global_ens_ecme_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o206873677 Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_precip_stats.206873677.dbqs01 Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o206873949 Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_precip_stats.206873949.dbqs01

jevs_global_ens_gefs_atmos_grid2grid_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206874214 Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206874214.dbqs01 Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206875487 Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206875487.dbqs01

jevs_global_ens_gefs_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_precip_stats.o206874217 Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_precip_stats.206874217.dbqs01/jevs_global_ens_gefs_atmos_precip_stats.o206874402 Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/ Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_precip_stats.206874402.dbqs01

jevs_global_ens_naefs_atmos_grid2grid_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206874568 Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.206874568.dbqs01 Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206875290 Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.206875290.dbqs01

jevs_global_ens_naefs_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o206874484 Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_precip_stats.206874484.dbqs01 Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o206875001 Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_precip_stats.206875001.dbqs01

malloryprow commented 6 days ago

Noticed one more thing with dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.sh should select=1:ncpus=1. ecf script is fine. Sorry I didn't catch that earlier with the memory.

malloryprow commented 6 days ago

I'm looking at the jobs some more now that they restart testing is complete. I'm noticing that the walltime for the restart runs is very close to the full run (the full run times are from today's parallel logs). Particularly for the grid2grid runs.

jevs_global_ens_cmce_atmos_grid2grid_stats.sh (full run) resources_used.walltime = 00:29:12 jevs_global_ens_cmce_atmos_grid2grid_stats.sh (interrupted run): resources_used.walltime = 00:12:06 jevs_global_ens_cmce_atmos_grid2grid_stats.sh (restart run): resources_used.walltime = 00:27:07

jevs_global_ens_ecme_atmos_grid2grid_stats.sh (full run) : resources_used.walltime = 01:05:23 jevs_global_ens_ecme_atmos_grid2grid_stats.sh (interrupted run): resources_used.walltime = 00:29:47 jevs_global_ens_ecme_atmos_grid2grid_stats.sh (restart run): resources_used.walltime = 00:51:19

jevs_global_ens_gefs_atmos_grid2grid_stats.sh (full run) : resources_used.walltime = 00:54:14 jevs_global_ens_gefs_atmos_grid2grid_stats.sh (interrupted run): resources_used.walltime = 00:18:41 jevs_global_ens_gefs_atmos_grid2grid_stats.sh (restart run): resources_used.walltime = 00:56:17

jevs_global_ens_naefs_atmos_grid2grid_stats.sh(full run) : resources_used.walltime = 00:19:08 jevs_global_ens_naefs_atmos_grid2grid_stats.sh (interrupted run): resources_used.walltime = 00:07:55 jevs_global_ens_naefs_atmos_grid2grid_stats.sh (restart run): resources_used.walltime = 00:19:08

jevs_global_ens_cmce_atmos_precip_stats.sh(full run): resources_used.walltime = 00:03:33 jevs_global_ens_cmce_atmos_precip_stats.sh (interrupted run): resources_used.walltime = 00:02:26 jevs_global_ens_cmce_atmos_precip_stats.sh (restart run): resources_used.walltime = 00:02:51

jevs_global_ens_ecme_atmos_precip_stats.sh(full run): resources_used.walltime = 00:08:15 jevs_global_ens_ecme_atmos_precip_stat.sh (interrupted run): resources_used.walltime = 00:06:03 jevs_global_ens_ecme_atmos_precip_stats.sh (restart run): resources_used.walltime = 00:07:21

jevs_global_ens_gefs_atmos_precip_stats.sh(full run): resources_used.walltime = 00:05:48 jevs_global_ens_gefs_atmos_precip_stats.sh (interrupted run): resources_used.walltime = 00:04:21 jevs_global_ens_gefs_atmos_precip_stats.sh (restart run): resources_used.walltime = 00:04:57

jevs_global_ens_naefs_atmos_precip_stats.sh(full run): resources_used.walltime = 00:03:32 jevs_global_ens_naefs_atmos_precip_stats.sh (interrupted run): resources_used.walltime = 00:03:22 jevs_global_ens_naefs_atmos_precip_stats.sh (restart run): resources_used.walltime = 00:02:51

AliciaBentley-NOAA commented 6 days ago

@malloryprow @GwenChen-NOAA I agree. These runtimes indicate that the restart capability is not working as intended. The restart runtimes are often just as long (or longer) than the full run runtimes. Are the restart runs possibly not using the restart files being produced?

GwenChen-NOAA commented 6 days ago

These runtimes indicate that the restart capability is not working as intended. The restart runtimes are often just as long (or longer) than the full run runtimes. Are the restart runs possibly not using the restart files being produced?

@AliciaBentley-NOAA, the restart runs do use the restart files as intended. @malloryprow made a mistake. The jevs_global_ens_gefs_atmos_grid2grid_stats.sh (full run) : resources_used.walltime is 00:57:20 in /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/full_test/jevs_global_ens_gefs_atmos_grid2grid_stats.o206855987, longer than the restart run. So, the restart runtime of all jobs is less than or equal to the full runtime.

The benefits of restart is not great for the grid2grid jobs, which run GenEnsProd, EnsembleStat, and GridStat tasks. The GenEnsProd task runs very fast, but the EnsembleStat and GridStat tasks take a long time to run (most of the runtime). So, any interruption to the EnsembleStat and GridStat tasks will take about the same time as the full run to rerun these two tasks.

GwenChen-NOAA commented 6 days ago

The restart test looks successful to me.

malloryprow commented 5 days ago

So if the run is through EnsembleStat through forecast hour 168 and then gets interrupted. When the job is restarted will it start again at forecast hour 0 or forecast hour 168 where it got interrupted during EnsembleStat?

I feel NCO is going to have a close eye on this since it got a waiver for EVS v1.0.

AliciaBentley-NOAA commented 5 days ago

@malloryprow @GwenChen-NOAA Thanks for the discussion.

In operations, the purpose of having restart capabilities is to considerably reduce a job's runtime when that job got part way through running and unexpectedly crashed. When jobs crash part way through running, NCO needs to rerun/finish the job as quickly as possible in order for the ops supercomputer to catch back up to where it should be. For example, if a job that typically takes 1 hour to run crashes at 45 minutes, restart capabilities should allow the job to complete in ~15 minutes when it is rerun. My worry is that NCO will not be satisfied with the restart runtimes in these examples and may even send EVS v2.0 back to us to fix. We'd like to avoid that.

Do either of you know how the restart capabilities that Gwen added to global_ens differ from the other restart capacities in EVS that do considerably reduce runtimes? For example, which component of EVS did Gwen use as an example to add these restart capability updates? Examining that EVS component and the code in this PR might reveal where things differ and allow us to get the reduced runtimes that NCO expects. Thanks!

GwenChen-NOAA commented 5 days ago

The forecast hour loop is set within the METplus job (EnsembleStat or GridStat) by setting the VALID_BEG, VALID_END, and VALID_INCREMENT options in the config file. So, if the METplus job is interrupted and then restart, it will start from the VALID_BEG again. This is a limitation of METplus, since METplus tools are not designed with restart capability.

The restart setup in this PR mimics the restart setup in NARRE restart (PR #465) that @AliciaBentley-NOAA provided to me. I think this is a common setup for all ensemble stats jobs restart.

malloryprow commented 5 days ago

VALID_BEG, VALID_END, and VALID_INCREMENT set the valid times METplus is looping over. LEAD_SEQ sets the forecasts hours.

GwenChen-NOAA commented 5 days ago

VALID_BEG, VALID_END, and VALID_INCREMENT set the valid times METplus is looping over. LEAD_SEQ sets the forecasts hours.

Right, forgot about LEAD_SEQ.

BinbinZhou-NOAA commented 5 days ago

The restart in SREF, HREF could be more completed than that in NARRE. For stats jobs, the restart point is from both VALID and FHR. So for global_ens, it should be possible to start from VALID and FHR too.

Binbin

On Wed, Nov 20, 2024 at 11:12 AM GwenChen-NOAA @.***> wrote:

The forecast hour loop is set within the METplus job (EnsembleStat or GridStat) by setting the VALID_BEG, VALID_END, and VALID_INCREMENT options in the config file. So, if the METplus job is interrupted and then restart, it will start from the VALID_BEG again. This is a limitation of METplus, since METplus tools are not designed with restart capability.

The restart setup in this PR mimics the restart setup in NARRE restart (PR

465 https://github.com/NOAA-EMC/EVS/pull/465) that @AliciaBentley-NOAA

https://github.com/AliciaBentley-NOAA provided to me. I think this is a common setup for all ensemble stats jobs restart.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/EVS/pull/604#issuecomment-2489004727, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQBMPFMH5U7OAMKOUA22W5D2BSYIFAVCNFSM6AAAAABR4EW3L6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGAYDINZSG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Binbin Zhou

Physical Scientist

Lynker at NOAA/NWS/NCEP/EMC

5830 University Research Ct.

College Park, MD 20740

@.***

301-683-3683

malloryprow commented 5 days ago

Thanks @BinbinZhou-NOAA for that info! I can confirm NARRE is showing similar things.

First is the run from the parallel, second is my test killed after 10 min, and third is the restart run.

jevs_narre_stats.o206930696: resources_used.walltime = 00:18:08 jevs_narre_stats.o206965219: resources_used.walltime = 00:10:08 jevs_narre_stats.o206966160: resources_used.walltime = 00:18:06

AliciaBentley-NOAA commented 5 days ago

Thanks, @GwenChen-NOAA and @malloryprow, for identifying that the merged NARRE restart code also needs to be fixed. I've added NARRE restart back into the EVS v2.0 Fixes and Additions document. CC @BinbinZhou-NOAA @AndrewBenjamin-NOAA

In order to fix restart in this PR, is there a way to make the valid date arrays and the forecast hour arrays dynamics (i.e., a variable that can be passed into MET/METplus? If the arrays are variables that can be set when the job runs, the existing restart files could be read in when a job is restarted and the valid times and forecast hours that were already completed could be skipped.

GwenChen-NOAA commented 5 days ago

The restart in SREF, HREF could be more completed than that in NARRE. For stats jobs, the restart point is from both VALID and FHR. So for global_ens, it should be possible to start from VALID and FHR too.

Thanks, @BinbinZhou-NOAA! Could you point me to the restart PR for SREF or HREF?

BinbinZhou-NOAA commented 5 days ago

Gwen,

The old version of SREF restart only starts from VALID, but the new version of SREF restarts from both VALID and FHR (updated in the new sref_fixes PR#607 https://github.com/NOAA-EMC/EVS/pull/607 You can refer to the file

https://github.com/NOAA-EMC/EVS/blob/develop/ush/mesoscale/evs_sref_grid2obs.sh

The restart for both VALID and FHR in HREF will be added to the next HREF PR

Binbin

On Wed, Nov 20, 2024 at 11:29 AM GwenChen-NOAA @.***> wrote:

VALID_BEG, VALID_END, and VALID_INCREMENT set the valid times METplus is looping over. LEAD_SEQ sets the forecasts hours.

Right, forgot about LEAD_SEQ.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/EVS/pull/604#issuecomment-2489047919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQBMPFP6MSBHU23MWLIQXDL2BS2FHAVCNFSM6AAAAABR4EW3L6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGA2DOOJRHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Binbin Zhou

Physical Scientist

Lynker at NOAA/NWS/NCEP/EMC

5830 University Research Ct.

College Park, MD 20740

@.***

301-683-3683

BinbinZhou-NOAA commented 5 days ago

Alicia,

There is no need to pass the lead time array into METplus conf files. Just aid additional on loop on lead time (FHR) in the ush file.

Binbin

On Wed, Nov 20, 2024 at 11:59 AM Alicia Bentley @.***> wrote:

Thanks, @GwenChen-NOAA https://github.com/GwenChen-NOAA and @malloryprow https://github.com/malloryprow, for identifying that the merged NARRE restart code also needs to be fixed. I've added NARRE restart back into the EVS v2.0 Fixes and Additions document. CC @BinbinZhou-NOAA https://github.com/BinbinZhou-NOAA @AndrewBenjamin-NOAA https://github.com/AndrewBenjamin-NOAA

In order to fix restart in this PR, is there a way to make the valid date arrays and the forecast hour arrays dynamics (i.e., a variable that can be passed into MET/METplus? If the arrays are variables that can be set when the job runs, the existing restart files could be read in when a job is restarted and the valid times and forecast hours that were already completed could be skipped.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/EVS/pull/604#issuecomment-2489117870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQBMPFIQQUEB5SCVUY6QBUT2BS5YNAVCNFSM6AAAAABR4EW3L6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGEYTOOBXGA . You are receiving this because you were mentioned.Message ID: @.***>

--

Binbin Zhou

Physical Scientist

Lynker at NOAA/NWS/NCEP/EMC

5830 University Research Ct.

College Park, MD 20740

@.***

301-683-3683

AliciaBentley-NOAA commented 5 days ago

@BinbinZhou-NOAA Thank you! I don't mind how we do the specifics of it as long as the runtimes of the restarted jobs are considerable shorter than the runtimes of the full jobs! :)

@GwenChen-NOAA Here is the previous SREF restart PR, if you want to isolate the changes that Binbin made: https://github.com/NOAA-EMC/EVS/pull/475/files

malloryprow commented 5 days ago

Looking at the job files the get run, the fhr looping could also be done there.

Whichever way you want to do it @GwenChen-NOAA. If you want to meet to talk through ideas or brainstorm, let me know!

BinbinZhou-NOAA commented 5 days ago

Alicia, Gwen,

Just split the lead time array to different fhrs. That is running the Metplus for each fhr. The old version set the lead time as a list of fhrs (take longer time) The new version set lead time just one fhr (take short time). In this case, the total jobs increase significantly, but the total running time does not increase (the cpu and memory settings do not change). For regional ensembles, not only the lead time is splitted, within each fhr, different processes (GenEnsProd, EnsembleStat and GridStat, or PointStat) are also specified separately. But for global_ens, since the number of the total lead time is too large, the processes within each fhr can be combined as one "completed" mark file. This is my suggestion.

Binbin

On Wed, Nov 20, 2024 at 12:08 PM Alicia Bentley @.***> wrote:

@BinbinZhou-NOAA https://github.com/BinbinZhou-NOAA Thank you! I don't mind how we do the specifics of it as long as the runtimes of the restarted jobs are considerable shorter than the runtimes of the full jobs! :)

@GwenChen-NOAA https://github.com/GwenChen-NOAA Here is the previous SREF restart PR, if you want to isolate the changes that Binbin made: https://github.com/NOAA-EMC/EVS/pull/475/files

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/EVS/pull/604#issuecomment-2489135665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQBMPFPCHY5USIXVQKOMZYL2BS6XFAVCNFSM6AAAAABR4EW3L6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGEZTKNRWGU . You are receiving this because you were mentioned.Message ID: @.***>

--

Binbin Zhou

Physical Scientist

Lynker at NOAA/NWS/NCEP/EMC

5830 University Research Ct.

College Park, MD 20740

@.***

301-683-3683

GwenChen-NOAA commented 5 days ago

@BinbinZhou-NOAA, thanks for the suggestion! I will give it a try.