Closed DavidHuber-NOAA closed 1 week ago
@InnocentSouopgui-NOAA after I have tested this on a couple machines, would you be able to test a forecast-only and cycled case on S4 and Jet? I'll let you know when it is ready to go.
@InnocentSouopgui-NOAA after I have tested this on a couple machines, would you be able to test a forecast-only and cycled case on S4 and Jet? I'll let you know when it is ready to go.
Sure, let me know when it is ready to test on S4.
@InnocentSouopgui-NOAA This PR is now ready to be tested on S4 and Jet. Would you mind running some cycled tests on both? I'm not sure if an S2SW test is possible on Jet, but if so, could you also run a forecast-only case there as well?
@InnocentSouopgui-NOAA This PR is now ready to be tested on S4 and Jet. Would you mind running some cycled tests on both? I'm not sure if an S2SW test is possible on Jet, but if so, could you also run a forecast-only case there as well?
I will get on it today, and check if S2SW is available on Jet.
@DavidHuber-NOAA, I have a couple of failing tasks on S4, that I am investigating.
enkfgdasediag
fails silently (job submission fails and the script does not detects), the dependant task enkfgdaseupd
also fails.
Bellow is the error message from the submission of enkfgdasediag
, you maybe able to pinpoint the problem quickly.
gfsfcst
is failing as well, but the error message and where it fails is different from run to run.
+ exglobal_diag.sh[225]: ncmd_max=32
++ exglobal_diag.sh[226]: eval echo srun -l --export=ALL -n '$ncmd' --multi-prog --output=mpmd.%j.%t.out
+++ exglobal_diag.sh[226]: echo srun -l --export=ALL -n 34 --multi-prog --output=mpmd.%j.%t.out
+ exglobal_diag.sh[226]: APRUNCFP_DIAG='srun -l --export=ALL -n 34 --multi-prog --output=mpmd.%j.%t.out'
+ exglobal_diag.sh[227]: srun -l --export=ALL -n 34 --multi-prog --output=mpmd.%j.%t.out /scratch/users/isouopgui/RUNDIRS/test30/ediag.234500/mp_diag.sh
srun: Warning: can't honor --ntasks-per-node set to 32 which doesn't match the requested tasks 34 with the number of requested nodes 2. Ignoring --ntasks-per-node.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: Exceeded job memory limit
srun: error: s4-204-c4: tasks 0,5,11-16: Killed
srun: Terminating job step 25889898.0
srun: error: s4-204-c6: tasks 17-28,30-33: Killed
@InnocentSouopgui-NOAA I increased the memory request for the ediag
job on S4. See if that works for you.
What resolution(s) are you running the fcst
at that it fails? Is there any indication that it may be an out of memory issue? If not, perhaps there is a node issue?
@InnocentSouopgui-NOAA I increased the memory request for the
ediag
job on S4. See if that works for you.What resolution(s) are you running the
fcst
at that it fails? Is there any indication that it may be an out of memory issue? If not, perhaps there is a node issue?
enkfgdasediag
succeds with the update.
gfsfcst
is still failing.
I can't ping point what is going wrong. Looking at the logfile, the ufs executatble itself seems to run until the end.
The only error message I find towards the end of the logfile is
<gw>/ush/forecast_postdet.sh: line 245: restart_dates[@]: unbound variable
I am not having that message on Jet. So it looks like there is setting on S4 or a missing setting causing the problem.
If you want to have a look, the logfile are at /scratch/users/isouopgui/test/test30/logs/2021110900
On S4, resolution 192/96, upp tasks are failing because of openMP over-allocation of resources.
What will be the best place to set OMP_NUM_THREADS
to prevent that?
I see a couple of possibilities:
1) Explicitly set OMP_NUM_THREADS
for UPP tasks; For instance, there is a variable NTHREADS_UPP
computed in OMP_NUM_THREADS
in UPP tasks. It can be used to set OMP_NUM_THREADS
right there in OMP_NUM_THREADS
to 1 early, and let tasks that rely on OpenMP set it when they run;
3) A solution for S4 only that uses any of the above.
@InnocentSouopgui-NOAA Could you try rocotoboot
ing the job again? The run directory (/scratch/users/isouopgui/RUNDIRS/test30/gfsfcst.2021110900/fcst.115872
) no longer exists. Thanks.
On S4, resolution 192/96, upp tasks are failing because of openMP over-allocation of resources. What will be the best place to set
OMP_NUM_THREADS
to prevent that?
@InnocentSouopgui-NOAA That is strange that the UPP is attempting to run threaded. I suspect this is an issue with the sbatch
running on S4 not properly picking up the thread count from rocoto. I think your first solution of adding it to env/S4.env
is the right way to go as I have not seen this on any other system.
@InnocentSouopgui-NOAA Could you try
rocotoboot
ing the job again? The run directory (/scratch/users/isouopgui/RUNDIRS/test30/gfsfcst.2021110900/fcst.115872
) no longer exists. Thanks.
Done.
@InnocentSouopgui-NOAA Could you try
rocotoboot
ing the job again? The run directory (/scratch/users/isouopgui/RUNDIRS/test30/gfsfcst.2021110900/fcst.115872
) no longer exists. Thanks.Done.
@InnocentSouopgui-NOAA It looks like the job failed because the inputs for the gfsfcst are no longer available:
FATAL ERROR: Cold start ICs are missing from '/scratch/users/isouopgui/test/test30/gfs.20211109/00//model_data/atmos/input'
Perhaps you could try rebooting the last gfsfcst (for cycle 202111110000)?
@InnocentSouopgui-NOAA Could you try
rocotoboot
ing the job again? The run directory (/scratch/users/isouopgui/RUNDIRS/test30/gfsfcst.2021110900/fcst.115872
) no longer exists. Thanks.Done.
@InnocentSouopgui-NOAA It looks like the job failed because the inputs for the gfsfcst are no longer available:
FATAL ERROR: Cold start ICs are missing from '/scratch/users/isouopgui/test/test30/gfs.20211109/00//model_data/atmos/input'
Perhaps you could try rebooting the last gfsfcst (for cycle 202111110000)?
It's a little trickier. What you saw is the error message from the third attempt, which is different from the error message from the first attempt. Have a look at the error message from the first attempt. gfsfcst is failing in all run so far on S4. However, the failure does not seem to affect subsequent tasks and cycles.
The only error message that I find in the logfile from the first attempt is the following:
/ush/forecast_postdet.sh: line 245: restart_dates[@]: unbound variable
It looks like S4 is not happy with the restart_dates[@] when it is empty. That is my suspicion for now.
@InnocentSouopgui-NOAA I see now, thanks for explaining that again for me. So, this looks like an issue with bash interpreters. The ones on Hera, Hercules, etc are all newer than S4 and treat references to empty arrays as valid references, but S4's version treats them as unbound variables. The way around this is to add a check for an empty array before running the for loop.
Fix
restart_dates=()
# Copy restarts in the assimilation window for RUN=gdas|enkfgdas|enkfgfs
if [[ "${RUN}" =~ "gdas" || "${RUN}" == "enkfgfs" ]]; then
restart_date="${model_start_date_next_cycle}"
while (( restart_date <= forecast_end_cycle )); do
restart_dates+=("${restart_date:0:8}.${restart_date:8:2}0000")
restart_date=$(date --utc -d "${restart_date:0:8} ${restart_date:8:2} + ${restart_interval} hours" +%Y%m%d%H)
done
elif [[ "${RUN}" == "gfs" || "${RUN}" == "gefs" ]]; then # Copy restarts at the end of the forecast segment for RUN=gfs|gefs
if [[ "${COPY_FINAL_RESTARTS}" == "YES" ]]; then
restart_dates+=("${forecast_end_cycle:0:8}.${forecast_end_cycle:8:2}0000")
fi
fi
### Check that there are restart files to copy
if [[ ${#restart_dates} -gt 0 ]]; then
# Get list of FV3 restart files
local file_list fv3_file
file_list=$(FV3_restarts)
# Copy restarts for the dates collected above to COM
for restart_date in "${restart_dates[@]}"; do
echo "Copying FV3 restarts for 'RUN=${RUN}' at ${restart_date}"
for fv3_file in ${file_list}; do
${NCP} "${DATArestart}/FV3_RESTART/${restart_date}.${fv3_file}" \
"${COMOUT_ATMOS_RESTART}/${restart_date}.${fv3_file}"
done
done
echo "SUB ${FUNCNAME[0]}: Output data for FV3 copied"
fi
}
@InnocentSouopgui-NOAA I have put the fix in place. Can you check it again on S4? You'll probably need to restart the experiment completely.
I successfully ran a couple of tests on S4 and Jet.
I submitted a pull request on @DavidHuber-NOAA global workflow repos with a couple of modifications I needed to get thinks running smoothly.
CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2672
CI Update on Wcoss2 at 06/23/24 06:36:08 AM
============================================
Cloning and Building global-workflow PR: 2672
with PID: 139980 on host: dlogin08
Automated global-workflow Testing Results:
Machine: Wcoss2
Start: Sun Jun 23 06:40:00 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/23/24 07:14:33 AM
Case setup: Completed for experiment C48_ATM_cb03a525
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_cb03a525
Case setup: Skipped for experiment C48_S2SWA_gefs_cb03a525
Case setup: Completed for experiment C48_S2SW_cb03a525
Case setup: Completed for experiment C96_atm3DVar_extended_cb03a525
Case setup: Skipped for experiment C96_atm3DVar_cb03a525
Case setup: Skipped for experiment C96_atmaerosnowDA_cb03a525
Case setup: Completed for experiment C96C48_hybatmDA_cb03a525
Case setup: Completed for experiment C96C48_ufs_hybatmDA_cb03a525
Experiment C96C48_ufs_hybatmDA_cb03a525 FAIL on Wcoss2 at 06/23/24 08:12:27 AM
Error logs:
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_cb03a525/logs/2024022400/enkfgdasatmensanlfv3inc.log
Follow link here to view the contents of the above file(s): (link)
CI Passed Hercules at
Built and ran in directory /work2/noaa/stmp/CI/HERCULES/2672
Fixed an issue with the atmensanlfv3inc declaration in config.resources that was causing the job to fail on WCOSS2. Relaunching WCOSS2 CI.
CI Update on Wcoss2 at 06/23/24 11:09:52 AM
============================================
Cloning and Building global-workflow PR: 2672
with PID: 5467 on host: dlogin08
Automated global-workflow Testing Results:
Machine: Wcoss2
Start: Sun Jun 23 11:14:49 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/23/24 11:50:02 AM
Case setup: Completed for experiment C48_ATM_184d2359
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_184d2359
Case setup: Skipped for experiment C48_S2SWA_gefs_184d2359
Case setup: Completed for experiment C48_S2SW_184d2359
Case setup: Completed for experiment C96_atm3DVar_extended_184d2359
Case setup: Skipped for experiment C96_atm3DVar_184d2359
Case setup: Skipped for experiment C96_atmaerosnowDA_184d2359
Case setup: Completed for experiment C96C48_hybatmDA_184d2359
Case setup: Completed for experiment C96C48_ufs_hybatmDA_184d2359
Experiment C48_ATM_184d2359 SUCCESS on Wcoss2 at 06/23/24 01:00:14 PM
Experiment C48_S2SW_184d2359 SUCCESS on Wcoss2 at 06/23/24 01:12:34 PM
Experiment C96C48_hybatmDA_184d2359 SUCCESS on Wcoss2 at 06/23/24 02:04:15 PM
Experiment C96C48_ufs_hybatmDA_184d2359 SUCCESS on Wcoss2 at 06/23/24 02:12:19 PM
Experiment C96_atm3DVar_extended_184d2359 **** on Wcoss2 at 06/23/24 08:04:19 PM
Error logs:
Follow link here to view the contents of the above file(s): [(link)]()
None of the gempak jobs ran on WCOSS even thought the dependencies are satisfied:
WCOSS2 (BACKUPSYS) C96_atm3DVar_extended_184d2359> rocotocheck -d C96_atm3DVar_extended_184d2359.db -w C96_atm3DVar_extended_184d2359.xml -t gdasgempak -c 202112210000
Task: gdasgempak
account: GFS-DEV
command: /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/global-workflow/jobs/rocoto/gempak.sh
cores: 2
cycledefs: gdas
final: false
jobname: C96_atm3DVar_extended_184d2359_gdasgempak_00
join: /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/COMROOT/C96_atm3DVar_extended_184d2359/logs/2021122100/gdasgempak.log
maxtries: 2
memory: 4GB
name: gdasgempak
nodes: 1:ppn=2:tpp=
queue: dev
throttle: 9999999
walltime: 03:00:00
environment
CDATE ==> 2021122100
CDUMP ==> gdas
COMROOT ==> /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/COMROOT
DATAROOT ==> /lfs/h2/emc/stmp/terry.mcguinness/RUNDIRS/C96_atm3DVar_extended_184d2359
EXPDIR ==> /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/EXPDIR/C96_atm3DVar_extended_184d2359
HOMEgfs ==> /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/global-workflow
NET ==> gfs
PDY ==> 20211221
RUN ==> gdas
RUN_ENVIR ==> emc
cyc ==> 00
dependencies
SOME is satisfied
gdasatmos_prod_f000 of cycle 202112210000 is SUCCEEDED
gdasatmos_prod_f003 of cycle 202112210000 is SUCCEEDED
gdasatmos_prod_f006 of cycle 202112210000 is SUCCEEDED
gdasatmos_prod_f009 of cycle 202112210000 is SUCCEEDED
Cycle: 202112210000
Valid for this task: YES
State: active
Activated: 2024-06-23 11:55:17 UTC
Completed: -
Expired: -
Job: This task has not been submitted for this cycle
Task can not be submitted because:
Is the job specification malformed, preventing submission to torque?
@WalterKolczynski-NOAA Looking at the rocoto log file /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/EXPDIR/C96_atm3DVar_extended_184d2359/logs/2021122100.log
, I am seeing the message
2220 2024-06-23 19:50:13 +0000 :: dlogin09 :: Submission of gdasgempak failed! qsub: Unknown resource: tpp
2221 Job submit error: 187.
Looking in the xml, I see
<task name="gdasgempak" cycledefs="gdas" maxtries="&MAXTRIES;">
<command>/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/global-workflow/jobs/rocoto/gempak.sh</command>
<jobname><cyclestr>C96_atm3DVar_extended_184d2359_gdasgempak_@H</cyclestr></jobname>
<account>GFS-DEV</account>
<queue>dev</queue>
<walltime>03:00:00</walltime>
<nodes>1:ppn=2:tpp=</nodes>
So it seems tpp
is not being set. Looking in config.resources, I see the issue:
"gempak")
export wtime_gempak="03:00:00"
export npe_gempak_gdas=2
export npe_gempak_gfs=28
export npe_node_gempak_gdas=2
export npe_node_gempak_gfs=28
export nth_gempak=1
export memory_gempak_gdas="4GB"
export memory_gempak_gfs="2GB"
var_npe_node="npe_node_gempak_${RUN}"
var_nth="nth_gempak_${RUN}"
var_npe="npe_gempak_${RUN}"
# RUN is set to a single value at setup time, so these won't be found
# TODO rework setup_xml.py to initialize RUN to the applicable option
if [[ -n "${!var_npe_node+0}" ]]; then
declare -x "npe_node_gempak"="${!var_npe_node}" \
"nth_gempak"="${!var_nth}" \
"npe_gempak"="${!var_npe}"
fi
;;
So nth_gempak
is being set to ${nth_gempak_gdas}
, which is undefined. I removed this assignment and am now rerunning the C96_atm3DVar_extended
test on WCOSS2.
gdasgempak
and gdasgempakmetancdc
completed successfully on the first full cycle:
CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION
================================================================================================================================
202112210000 gdasgempak 153723934 SUCCEEDED 0 1 329.0
202112210000 gdasgempakmetancdc 153724367 SUCCEEDED 0 1 27.0
@WalterKolczynski-NOAA The manual C96_atm3DVar_extended
test passed on WCOSS2.
CI Update on Wcoss2 at 06/24/24 05:16:04 PM
============================================
Cloning and Building global-workflow PR: 2672
with PID: 89079 on host: dlogin08
Automated global-workflow Testing Results:
Machine: Wcoss2
Start: Mon Jun 24 17:20:18 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/24/24 05:55:10 PM
Case setup: Completed for experiment C48_ATM_fe859507
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_fe859507
Case setup: Skipped for experiment C48_S2SWA_gefs_fe859507
Case setup: Completed for experiment C48_S2SW_fe859507
Case setup: Completed for experiment C96_atm3DVar_extended_fe859507
Case setup: Skipped for experiment C96_atm3DVar_fe859507
Case setup: Skipped for experiment C96_atmaerosnowDA_fe859507
Case setup: Completed for experiment C96C48_hybatmDA_fe859507
Case setup: Completed for experiment C96C48_ufs_hybatmDA_fe859507
Experiment C48_ATM_fe859507 SUCCESS on Wcoss2 at 06/24/24 07:48:18 PM
Experiment C48_S2SW_fe859507 SUCCESS on Wcoss2 at 06/24/24 07:48:22 PM
Experiment C96C48_hybatmDA_fe859507 SUCCESS on Wcoss2 at 06/24/24 09:16:19 PM
Experiment C96C48_ufs_hybatmDA_fe859507 SUCCESS on Wcoss2 at 06/24/24 09:16:23 PM
Experiment C96_atm3DVar_extended_fe859507 SUCCESS on Wcoss2 at 06/25/24 05:36:31 AM
All CI Test Cases Passed on Wcoss2:
Experiment C48_ATM_fe859507 *** SUCCESS *** at 06/24/24 07:48:18 PM
Experiment C48_S2SW_fe859507 *** SUCCESS *** at 06/24/24 07:48:22 PM
Experiment C96C48_hybatmDA_fe859507 *** SUCCESS *** at 06/24/24 09:16:19 PM
Experiment C96C48_ufs_hybatmDA_fe859507 *** SUCCESS *** at 06/24/24 09:16:23 PM
Experiment C96_atm3DVar_extended_fe859507 *** SUCCESS *** at 06/25/24 05:36:31 AM
Description
This redefines resource variables so they capture the
RUN
orCDUMP
that they are valid for. Additionally, machine-specific resources are moved out ofconfig.resources
and placed in respectiveconfig.resources.{machine}
files.Resolves #177 #2672 Also helps address 2092 for WCOSS2 and Gaea
Type of change
Change characteristics
How has this been tested?
Created multiple experiments on Hercules. Additional tests to follow.
Checklist