Closed lgannoaa closed 2 years ago
@lgannoaa To set up this parallel:
1) Use a freshly built copy of feature/ops-wcoss2
for setup but change HOMEgfs to use prod install for running (see path below)
2) Make sure config.resources
is copied from config.resources.nco.static
and config.fv3
is copied from config.fv3.nco.static
so you use the same operational resources settings. The emc.dyn versions of those configs aren't useful for this test.
cd into your clone of feature/ops-wcoss2
cd parm/config
cp config.resources.nco.static config.resources
cp config.fv3.nco.static config.fv3
3) Run setup_expt_fcstonly.py with:
./setup_expt_fcstonly.py --pslot EXP_NAME --res 768 --comrot YOUR_COMROT --expdir YOUR_EXPDIR --idate 2022052100 --edate 2022061500 --configdir ../../parm/config --gfs_cyc 4 --start warm
4) Set the following in your newly generated EXPDIR config.base before running setup_workflow_fcstonly.py
:
export HOMEgfs=/lfs/h1/ops/prod/packages/gfs.v16.2.0
export DO_BUFRSND="YES" # BUFR sounding products
export DO_GEMPAK="YES" # GEMPAK products
export DO_AWIPS="YES" # AWIPS products
export WAFSF="NO" # WAFS products
export FHMAX_HF_GFS=120
export DO_METP="NO"
5) Open your config.vrfy and update the tracker section to set this on WCOSS2:
export HOMEens_tracker=$BASE_GIT/TC_tracker/v1.1.15.5
6) Then run setup_workflow_fcstonly.py
:
./setup_workflow_fcstonly.py --expdir PATH_TO_GENERATED_EXPDIR
6) Update gfsgetic job to use specialized job script built by @aerorahul
Copy new script into your EXPDIR:
cd $EXPDIR
mkdir jobs
cd jobs
cp /lfs/h2/emc/eib/noscrub/Rahul.Mahajan/gfsv16.2/EXPDIR/retro_eval/jobs/getic_tarball.sh .
Open your xml and update the <command>
line in the gfsgetic job:
<task name="gfsgetic" cycledefs="gfs" maxtries="&MAXTRIES;">
<command>&EXPDIR;/jobs/getic_tarball.sh</command>
Note: how the ICs are obtained will be changed once the parallel gets closer to real-time and the ICs aren't yet available on HPSS. Speak with @aerorahul at that time.
Let me know when your parallel is setup and I can give it quick glance. Thanks!
Modification made to set: export DO_BUFRSND="NO" # BUFR sounding products export DO_GEMPAK="NO" # GEMPAK products export DO_AWIPS="NO" # AWIPS products export WAFSF="NO" # WAFS products
This forecast only parallel is started. The configuration information is as follows: HOMEgfs: /lfs/h1/ops/prod/packages/gfs.v16.2.0 pslot: gfseval_a expdir: /lfs/h2/emc/global/noscrub/Lin.Gan/expdir/gfseval_a COM: /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a log: /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/logs on-line archive: /lfs/h2/emc/global/noscrub/Lin.Gan/archive/gfseval_a
@aerorahul @KateFriedman-NOAA , this parallel is started. Please check and let me know if you see any issue.
@lgannoaa @aerorahul We likely want to disable scrubbing of the output for a while so MEG can use the output in evaluations. Suggest setting DELETE_COM_IN_ARCHIVE_JOB=NO
in config.base. The first round of scrubbing based on ROLDEND=24
is about to happen @lgannoaa . Can you bump that up to RMOLDEND=144
for now in config.arch while we discuss retention? Thanks!
@aerorahul @KateFriedman-NOAA are we done with discuss retention? I need to modify ROLDEND and reenable archive job scrubbing before Friday. I prefer not making major change to parallel on a Friday.
As of June 2nd. 6:45P EST. The COM is at 41T. When it is over 50T, I will implement change to set ROLDEND=24 and DELETE_COM_IN_ARCHIVE_JOB=YES to enable default cleanup.
Made change in expdir/config.arch RMOLDEND=24 -> RMOLDEND=144 Made change in expdir/config.base DELETE_COM_IN_ARCHIVE_JOB="YES" -> DELETE_COM_IN_ARCHIVE_JOB="NO" The COM has 56T as of 2022052600 the following change has been made to restore default COM clean up: RMOLDEND=144 -> RMOLDEND=24 DELETE_COM_IN_ARCHIVE_JOB="NO" -> DELETE_COM_IN_ARCHIVE_JOB="YES"
Because this parallel is a forecast only parallel, the analysis is not been run.
The following files are not exist:
${COM}/gfs.${PDY}/${CYC}/atmos/gfs.t${CYC}z.gsistat
${COM}/gfs.${PDY}/${CYC}/atmos/gfs.t${CYC}z.pgrb2.1p00.anl
There are two impacts:
The following files are not exist: ${COM}/gfs.${PDY}/${CYC}/atmos/gfs.t${CYC}z.gsistat ${COM}/gfs.${PDY}/${CYC}/atmos/gfs.t${CYC}z.pgrb2.1p00.anl
@lgannoaa Please check if either or both of those files exist in the WCOSS1 production runhistory tapes. If so, the getic_tarball.sh script could be modified to also pull those files off tape alongside the RESTART files each cycle. Thanks! @KateFriedman-NOAA I am waiting for verification group to let me know what is acceptable configuration. Using production runhistory analysis files may not be acceptable for this evaluation parallel.
@lgannoaa
We can pull the pgbanl.gfs.*
files from tape in the getic_tarball.sh
script and place them in ARCDIR
I think that should allow verification and met-plus to do its evaluation. @malloryprow will that be sufficient?
@aerorahul I'm going to set the config up to using "gfs_anl" for grid2grid stats instead of "self_anl". This will use the pgbanl.gfs.* files in /lfs/h2/emc/vpppg/noscrub/emc.vpppg/verification/global/archive/model_data/gfs which are being synced from WCOSS
Thank you for quick reply from @malloryprow I will modify the archive job to ignore the missing (not needed) analysis files.
@JessicaMeixner-NOAA @aerorahul I see a warning in file cactus:/lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/logs/2022052200/gfsfcst.log (line 2458) 0.747 + [ ! -f /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/gdas.20220521/18/wave/restart/20220521.210000.restart.gnh_10m ] 0.751 + [ ! -f /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/gdas.20220521/18/wave/restart/20220521.210000.restart.aoc_9km ] 0.754 + [ ! -f /lfs/h2/emc/ptmp/Lin.Gan/gfseval_a/gdas.20220521/18/wave/restart/20220521.210000.restart.gsh_15m ] Message: WARNING: NON-FATAL ERROR wave IC is missing, will start from rest' This finding may changing the forecast science output and impact the verification results. Let me know what action to take. @arunchawla-NOAA indicated this will not affect the atmosphere runs, the wave runs will cold start. Therefore, no action needed.
@arunchawla-NOAA @aerorahul , This parallel is currently setup using rocoto workflow in sequential mode. That means each cycle will take around six hours to complete. I can modify the rocoto workflow to run with more cycles running in parallel (concurrently). That will speed up to catch up to the realtime. May I proceed?
@arunchawla-NOAA @aerorahul , This parallel is currently setup using rocoto workflow in sequential mode. That means each cycle will take around six hours to complete. I can modify the rocoto workflow to run with more cycles running in parallel (concurrently). That will speed up to catch up to the realtime. May I proceed?
Yes. Please do multiple concurrent cycles. However, do note that you may reach number of files quota if too many concurrent cycles are enabled.
@arunchawla-NOAA @aerorahul With concurrent cycle run turned on, I estimate the run will reach PDY-3 later today. The PDY-3 looks like is a safe setting for download production runhistory IC. That means I will switch the parallel to run once a day (all four cycles for PDY-3 at once).
@JessicaMeixner-NOAA @aerorahul Job gfswavepostpnt, gfswavepostbndpntbll and gfswavepostbndpnt regularly hit wall clock. Would it be ok that I increase the job card time?
@lgannoaa that sounds like a smart idea to me. If they're running at the same time we're likely seeing I/O or filesystem issues. I've seen this on WCOSS1 where they run slower for us than for NCO.
gfswavepostpnt, gfswavepostbndpntbll and gfswavepostbndpnt job cards are modified to increase time.
@arunchawla-NOAA @aerorahul @GeoffManikin-NOAA This parallel is suspended as of June 6th 8:00AM EST (all cycles prior to 2022060400 are completed) due to the following events:
This parallel will be resumed after production switch is finished and dogwood is return to the developer. Schedule is unknown at this time.
@malloryprow after EIB transferred this parallel from cactus to dogwood, we will need your support to ensure no missing files for running verification package in dogwood.
On June 8th 4:00p EST. WCOSS2 Dogwood is returned to the developer. This parallel is in progress of recover. The on-line archive is been rsync-ed. This parallel is now resumed with Hybrid (concurrent cycle running at the same time) mode turned on. First cycle is 2022060400 on Dogwood.
NCEP VPN internet down at 9:00p EST on June 8th. Impact:
The Internet working again after about 30 minutes.
@malloryprow this parallel is resumed on Dogwood. The first two days 2022060400-2022060518 (eight cycles) run is completed. Please take a look at the files and continue to run the verification jobs. Let me know if you have any questions. This parallel is caught on near realtime. That is currently PDY-2 on Dogwood.
HPSS will be unavailable on June 14th, Will rerun archive jobs as needed.
On June 28th. @arunchawla-NOAA made a decision to stop running this parallel. The parellel is now stopped; cron jobs removed. The last completed CDATE=2022062318 due to missing production run history archive files on 20220624.
This is to document setup and running the GFS evaluation parallel in WCOSS2. On May 31st 2022, @arunchawla-NOAA @aerorahul notified:
NCEP management has been concerned with the differences that we are seeing at day 10 on wcoss2 vs wcoss1. We need to start a parallel that can match the wcoss2 parallel but have initial conditions from wcoss1. They would like us to start a parallel where the initial conditions come from wcoss1 and the forecast runs on wcoss2.
@aerorahul Please note: The configuration of this parallel: