NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
75 stars 168 forks source link

GFS evaluation parallel in WCOSS2 #830

Closed lgannoaa closed 2 years ago

lgannoaa commented 2 years ago

This is to document setup and running the GFS evaluation parallel in WCOSS2. On May 31st 2022, @arunchawla-NOAA @aerorahul notified:

NCEP management has been concerned with the differences that we are seeing at day 10 on wcoss2 vs wcoss1. We need to start a parallel that can match the wcoss2 parallel but have initial conditions from wcoss1. They would like us to start a parallel where the initial conditions come from wcoss1 and the forecast runs on wcoss2.

@aerorahul Please note: The configuration of this parallel:

KateFriedman-NOAA commented 2 years ago

@lgannoaa To set up this parallel:

1) Use a freshly built copy of feature/ops-wcoss2 for setup but change HOMEgfs to use prod install for running (see path below) 2) Make sure config.resources is copied from config.resources.nco.static and config.fv3 is copied from config.fv3.nco.static so you use the same operational resources settings. The emc.dyn versions of those configs aren't useful for this test.

cd into your clone of feature/ops-wcoss2
cd parm/config
cp config.resources.nco.static config.resources
cp config.fv3.nco.static config.fv3

3) Run setup_expt_fcstonly.py with:

./setup_expt_fcstonly.py --pslot EXP_NAME --res 768 --comrot YOUR_COMROT --expdir YOUR_EXPDIR --idate 2022052100 --edate 2022061500 --configdir ../../parm/config --gfs_cyc 4 --start warm

4) Set the following in your newly generated EXPDIR config.base before running setup_workflow_fcstonly.py:

export HOMEgfs=/lfs/h1/ops/prod/packages/gfs.v16.2.0

export DO_BUFRSND="YES"     # BUFR sounding products
export DO_GEMPAK="YES"      # GEMPAK products
export DO_AWIPS="YES"       # AWIPS products
export WAFSF="NO"          # WAFS products

export FHMAX_HF_GFS=120

export DO_METP="NO"

5) Open your config.vrfy and update the tracker section to set this on WCOSS2:

export HOMEens_tracker=$BASE_GIT/TC_tracker/v1.1.15.5

6) Then run setup_workflow_fcstonly.py:

./setup_workflow_fcstonly.py --expdir PATH_TO_GENERATED_EXPDIR

6) Update gfsgetic job to use specialized job script built by @aerorahul

Copy new script into your EXPDIR:

cd $EXPDIR
mkdir jobs
cd jobs
cp /lfs/h2/emc/eib/noscrub/Rahul.Mahajan/gfsv16.2/EXPDIR/retro_eval/jobs/getic_tarball.sh .

Open your xml and update the <command> line in the gfsgetic job:

<task name="gfsgetic" cycledefs="gfs" maxtries="&MAXTRIES;">

        <command>&EXPDIR;/jobs/getic_tarball.sh</command>

Note: how the ICs are obtained will be changed once the parallel gets closer to real-time and the ICs aren't yet available on HPSS. Speak with @aerorahul at that time.

Let me know when your parallel is setup and I can give it quick glance. Thanks!

lgannoaa commented 2 years ago

Modification made to set: export DO_BUFRSND="NO" # BUFR sounding products export DO_GEMPAK="NO" # GEMPAK products export DO_AWIPS="NO" # AWIPS products export WAFSF="NO" # WAFS products

lgannoaa commented 2 years ago

This forecast only parallel is started. The configuration information is as follows: HOMEgfs: /lfs/h1/ops/prod/packages/gfs.v16.2.0 pslot: gfseval_a expdir: /lfs/h2/emc/global/noscrub/Lin.Gan/expdir/gfseval_a COM: /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a log: /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/logs on-line archive: /lfs/h2/emc/global/noscrub/Lin.Gan/archive/gfseval_a

lgannoaa commented 2 years ago

@aerorahul @KateFriedman-NOAA , this parallel is started. Please check and let me know if you see any issue.

KateFriedman-NOAA commented 2 years ago

@lgannoaa @aerorahul We likely want to disable scrubbing of the output for a while so MEG can use the output in evaluations. Suggest setting DELETE_COM_IN_ARCHIVE_JOB=NO in config.base. The first round of scrubbing based on ROLDEND=24 is about to happen @lgannoaa . Can you bump that up to RMOLDEND=144 for now in config.arch while we discuss retention? Thanks! @aerorahul @KateFriedman-NOAA are we done with discuss retention? I need to modify ROLDEND and reenable archive job scrubbing before Friday. I prefer not making major change to parallel on a Friday. As of June 2nd. 6:45P EST. The COM is at 41T. When it is over 50T, I will implement change to set ROLDEND=24 and DELETE_COM_IN_ARCHIVE_JOB=YES to enable default cleanup.

lgannoaa commented 2 years ago

Made change in expdir/config.arch RMOLDEND=24 -> RMOLDEND=144 Made change in expdir/config.base DELETE_COM_IN_ARCHIVE_JOB="YES" -> DELETE_COM_IN_ARCHIVE_JOB="NO" The COM has 56T as of 2022052600 the following change has been made to restore default COM clean up: RMOLDEND=144 -> RMOLDEND=24 DELETE_COM_IN_ARCHIVE_JOB="NO" -> DELETE_COM_IN_ARCHIVE_JOB="YES"

lgannoaa commented 2 years ago

Because this parallel is a forecast only parallel, the analysis is not been run. The following files are not exist: ${COM}/gfs.${PDY}/${CYC}/atmos/gfs.t${CYC}z.gsistat ${COM}/gfs.${PDY}/${CYC}/atmos/gfs.t${CYC}z.pgrb2.1p00.anl There are two impacts:

  1. Archive job setting need to be modified to avoid fail.
  2. MetPlus and gplots verification package jobs can not do true evaluation between wcoss1 and wcoss2. This may impact the requirement of this parallel. Conversation within verification group has been started. Waiting for response.
KateFriedman-NOAA commented 2 years ago

The following files are not exist: ${COM}/gfs.${PDY}/${CYC}/atmos/gfs.t${CYC}z.gsistat ${COM}/gfs.${PDY}/${CYC}/atmos/gfs.t${CYC}z.pgrb2.1p00.anl

@lgannoaa Please check if either or both of those files exist in the WCOSS1 production runhistory tapes. If so, the getic_tarball.sh script could be modified to also pull those files off tape alongside the RESTART files each cycle. Thanks! @KateFriedman-NOAA I am waiting for verification group to let me know what is acceptable configuration. Using production runhistory analysis files may not be acceptable for this evaluation parallel.

aerorahul commented 2 years ago

@lgannoaa We can pull the pgbanl.gfs.* files from tape in the getic_tarball.sh script and place them in ARCDIR I think that should allow verification and met-plus to do its evaluation. @malloryprow will that be sufficient?

malloryprow commented 2 years ago

@aerorahul I'm going to set the config up to using "gfs_anl" for grid2grid stats instead of "self_anl". This will use the pgbanl.gfs.* files in /lfs/h2/emc/vpppg/noscrub/emc.vpppg/verification/global/archive/model_data/gfs which are being synced from WCOSS

lgannoaa commented 2 years ago

Thank you for quick reply from @malloryprow I will modify the archive job to ignore the missing (not needed) analysis files.

lgannoaa commented 2 years ago

@JessicaMeixner-NOAA @aerorahul I see a warning in file cactus:/lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/logs/2022052200/gfsfcst.log (line 2458) 0.747 + [ ! -f /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/gdas.20220521/18/wave/restart/20220521.210000.restart.gnh_10m ] 0.751 + [ ! -f /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/gdas.20220521/18/wave/restart/20220521.210000.restart.aoc_9km ] 0.754 + [ ! -f /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/gdas.20220521/18/wave/restart/20220521.210000.restart.gsh_15m ] Message: WARNING: NON-FATAL ERROR wave IC is missing, will start from rest' This finding may changing the forecast science output and impact the verification results. Let me know what action to take. @arunchawla-NOAA indicated this will not affect the atmosphere runs, the wave runs will cold start. Therefore, no action needed.

lgannoaa commented 2 years ago

@arunchawla-NOAA @aerorahul , This parallel is currently setup using rocoto workflow in sequential mode. That means each cycle will take around six hours to complete. I can modify the rocoto workflow to run with more cycles running in parallel (concurrently). That will speed up to catch up to the realtime. May I proceed?

aerorahul commented 2 years ago

@arunchawla-NOAA @aerorahul , This parallel is currently setup using rocoto workflow in sequential mode. That means each cycle will take around six hours to complete. I can modify the rocoto workflow to run with more cycles running in parallel (concurrently). That will speed up to catch up to the realtime. May I proceed?

Yes. Please do multiple concurrent cycles. However, do note that you may reach number of files quota if too many concurrent cycles are enabled.

lgannoaa commented 2 years ago

@arunchawla-NOAA @aerorahul With concurrent cycle run turned on, I estimate the run will reach PDY-3 later today. The PDY-3 looks like is a safe setting for download production runhistory IC. That means I will switch the parallel to run once a day (all four cycles for PDY-3 at once).

lgannoaa commented 2 years ago

@JessicaMeixner-NOAA @aerorahul Job gfswavepostpnt, gfswavepostbndpntbll and gfswavepostbndpnt regularly hit wall clock. Would it be ok that I increase the job card time?

JessicaMeixner-NOAA commented 2 years ago

@lgannoaa that sounds like a smart idea to me. If they're running at the same time we're likely seeing I/O or filesystem issues. I've seen this on WCOSS1 where they run slower for us than for NCO.

lgannoaa commented 2 years ago

gfswavepostpnt, gfswavepostbndpntbll and gfswavepostbndpnt job cards are modified to increase time.

lgannoaa commented 2 years ago

@arunchawla-NOAA @aerorahul @GeoffManikin-NOAA This parallel is suspended as of June 6th 8:00AM EST (all cycles prior to 2022060400 are completed) due to the following events:

  1. WCOSS2 dev (cactus) machine will be offline today for user account maintenance. It is unclear to the developer when the machine will be back on-line.
  2. HPSS is scheduled on maintenance tomorrow June 7th.
  3. The parallel need to be transferred from cactus to dogwood due to the planned production switch on June 8th.

This parallel will be resumed after production switch is finished and dogwood is return to the developer. Schedule is unknown at this time.

@malloryprow after EIB transferred this parallel from cactus to dogwood, we will need your support to ensure no missing files for running verification package in dogwood.

lgannoaa commented 2 years ago

On June 8th 4:00p EST. WCOSS2 Dogwood is returned to the developer. This parallel is in progress of recover. The on-line archive is been rsync-ed. This parallel is now resumed with Hybrid (concurrent cycle running at the same time) mode turned on. First cycle is 2022060400 on Dogwood.

lgannoaa commented 2 years ago

NCEP VPN internet down at 9:00p EST on June 8th. Impact:

The Internet working again after about 30 minutes.

lgannoaa commented 2 years ago

@malloryprow this parallel is resumed on Dogwood. The first two days 2022060400-2022060518 (eight cycles) run is completed. Please take a look at the files and continue to run the verification jobs. Let me know if you have any questions. This parallel is caught on near realtime. That is currently PDY-2 on Dogwood.

lgannoaa commented 2 years ago

HPSS will be unavailable on June 14th, Will rerun archive jobs as needed.

lgannoaa commented 2 years ago

On June 28th. @arunchawla-NOAA made a decision to stop running this parallel. The parellel is now stopped; cron jobs removed. The last completed CDATE=2022062318 due to missing production run history archive files on 20220624.