NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Assign machine- and RUN-specific resources #2672

Closed DavidHuber-NOAA closed 1 week ago

DavidHuber-NOAA commented 3 weeks ago

Description

This redefines resource variables so they capture the RUN or CDUMP that they are valid for. Additionally, machine-specific resources are moved out of config.resources and placed in respective config.resources.{machine} files.

Resolves #177 #2672 Also helps address 2092 for WCOSS2 and Gaea

Type of change

Change characteristics

How has this been tested?

Created multiple experiments on Hercules. Additional tests to follow.

Checklist

DavidHuber-NOAA commented 3 weeks ago

@InnocentSouopgui-NOAA after I have tested this on a couple machines, would you be able to test a forecast-only and cycled case on S4 and Jet? I'll let you know when it is ready to go.

InnocentSouopgui-NOAA commented 3 weeks ago

@InnocentSouopgui-NOAA after I have tested this on a couple machines, would you be able to test a forecast-only and cycled case on S4 and Jet? I'll let you know when it is ready to go.

Sure, let me know when it is ready to test on S4.

DavidHuber-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA This PR is now ready to be tested on S4 and Jet. Would you mind running some cycled tests on both? I'm not sure if an S2SW test is possible on Jet, but if so, could you also run a forecast-only case there as well?

InnocentSouopgui-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA This PR is now ready to be tested on S4 and Jet. Would you mind running some cycled tests on both? I'm not sure if an S2SW test is possible on Jet, but if so, could you also run a forecast-only case there as well?

I will get on it today, and check if S2SW is available on Jet.

InnocentSouopgui-NOAA commented 2 weeks ago

@DavidHuber-NOAA, I have a couple of failing tasks on S4, that I am investigating.

enkfgdasediag fails silently (job submission fails and the script does not detects), the dependant task enkfgdaseupd also fails. Bellow is the error message from the submission of enkfgdasediag, you maybe able to pinpoint the problem quickly.

gfsfcst is failing as well, but the error message and where it fails is different from run to run.

+ exglobal_diag.sh[225]: ncmd_max=32
++ exglobal_diag.sh[226]: eval echo srun -l --export=ALL -n '$ncmd' --multi-prog --output=mpmd.%j.%t.out
+++ exglobal_diag.sh[226]: echo srun -l --export=ALL -n 34 --multi-prog --output=mpmd.%j.%t.out
+ exglobal_diag.sh[226]: APRUNCFP_DIAG='srun -l --export=ALL -n 34 --multi-prog --output=mpmd.%j.%t.out'
+ exglobal_diag.sh[227]: srun -l --export=ALL -n 34 --multi-prog --output=mpmd.%j.%t.out /scratch/users/isouopgui/RUNDIRS/test30/ediag.234500/mp_diag.sh
srun: Warning: can't honor --ntasks-per-node set to 32 which doesn't match the requested tasks 34 with the number of requested nodes 2. Ignoring --ntasks-per-node.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: Exceeded job memory limit
srun: error: s4-204-c4: tasks 0,5,11-16: Killed
srun: Terminating job step 25889898.0
srun: error: s4-204-c6: tasks 17-28,30-33: Killed
DavidHuber-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA I increased the memory request for the ediag job on S4. See if that works for you.

What resolution(s) are you running the fcst at that it fails? Is there any indication that it may be an out of memory issue? If not, perhaps there is a node issue?

InnocentSouopgui-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA I increased the memory request for the ediag job on S4. See if that works for you.

What resolution(s) are you running the fcst at that it fails? Is there any indication that it may be an out of memory issue? If not, perhaps there is a node issue?

enkfgdasediag succeds with the update. gfsfcst is still failing.

I can't ping point what is going wrong. Looking at the logfile, the ufs executatble itself seems to run until the end.

The only error message I find towards the end of the logfile is

<gw>/ush/forecast_postdet.sh: line 245: restart_dates[@]: unbound variable

I am not having that message on Jet. So it looks like there is setting on S4 or a missing setting causing the problem. If you want to have a look, the logfile are at /scratch/users/isouopgui/test/test30/logs/2021110900

InnocentSouopgui-NOAA commented 1 week ago

On S4, resolution 192/96, upp tasks are failing because of openMP over-allocation of resources. What will be the best place to set OMP_NUM_THREADS to prevent that? I see a couple of possibilities: 1) Explicitly set OMP_NUM_THREADS for UPP tasks; For instance, there is a variable NTHREADS_UPP computed in /env/.env, but I don't find it used anywhere. It looks like the best candidate for OMP_NUM_THREADS in UPP tasks. It can be used to set OMP_NUM_THREADS right there in /env/.env 2) set OMP_NUM_THREADS to 1 early, and let tasks that rely on OpenMP set it when they run; 3) A solution for S4 only that uses any of the above.

DavidHuber-NOAA commented 1 week ago

@InnocentSouopgui-NOAA Could you try rocotobooting the job again? The run directory (/scratch/users/isouopgui/RUNDIRS/test30/gfsfcst.2021110900/fcst.115872) no longer exists. Thanks.

DavidHuber-NOAA commented 1 week ago

On S4, resolution 192/96, upp tasks are failing because of openMP over-allocation of resources. What will be the best place to set OMP_NUM_THREADS to prevent that?

@InnocentSouopgui-NOAA That is strange that the UPP is attempting to run threaded. I suspect this is an issue with the sbatch running on S4 not properly picking up the thread count from rocoto. I think your first solution of adding it to env/S4.env is the right way to go as I have not seen this on any other system.

InnocentSouopgui-NOAA commented 1 week ago

@InnocentSouopgui-NOAA Could you try rocotobooting the job again? The run directory (/scratch/users/isouopgui/RUNDIRS/test30/gfsfcst.2021110900/fcst.115872) no longer exists. Thanks.

Done.

DavidHuber-NOAA commented 1 week ago

@InnocentSouopgui-NOAA Could you try rocotobooting the job again? The run directory (/scratch/users/isouopgui/RUNDIRS/test30/gfsfcst.2021110900/fcst.115872) no longer exists. Thanks.

Done.

@InnocentSouopgui-NOAA It looks like the job failed because the inputs for the gfsfcst are no longer available:

 FATAL ERROR: Cold start ICs are missing from '/scratch/users/isouopgui/test/test30/gfs.20211109/00//model_data/atmos/input'

Perhaps you could try rebooting the last gfsfcst (for cycle 202111110000)?

InnocentSouopgui-NOAA commented 1 week ago

@InnocentSouopgui-NOAA Could you try rocotobooting the job again? The run directory (/scratch/users/isouopgui/RUNDIRS/test30/gfsfcst.2021110900/fcst.115872) no longer exists. Thanks.

Done.

@InnocentSouopgui-NOAA It looks like the job failed because the inputs for the gfsfcst are no longer available:

 FATAL ERROR: Cold start ICs are missing from '/scratch/users/isouopgui/test/test30/gfs.20211109/00//model_data/atmos/input'

Perhaps you could try rebooting the last gfsfcst (for cycle 202111110000)?

It's a little trickier. What you saw is the error message from the third attempt, which is different from the error message from the first attempt. Have a look at the error message from the first attempt. gfsfcst is failing in all run so far on S4. However, the failure does not seem to affect subsequent tasks and cycles.

The only error message that I find in the logfile from the first attempt is the following:

/ush/forecast_postdet.sh: line 245: restart_dates[@]: unbound variable

It looks like S4 is not happy with the restart_dates[@] when it is empty. That is my suspicion for now.

DavidHuber-NOAA commented 1 week ago

@InnocentSouopgui-NOAA I see now, thanks for explaining that again for me. So, this looks like an issue with bash interpreters. The ones on Hera, Hercules, etc are all newer than S4 and treat references to empty arrays as valid references, but S4's version treats them as unbound variables. The way around this is to add a check for an empty array before running the for loop.

Problem code https://github.com/NOAA-EMC/global-workflow/blob/8993b42cb91144c0ab0501dc7841ea8d675c4701/ush/forecast_postdet.sh#L257-L284

Fix

  restart_dates=()
  # Copy restarts in the assimilation window for RUN=gdas|enkfgdas|enkfgfs
  if [[ "${RUN}" =~ "gdas" || "${RUN}" == "enkfgfs" ]]; then
    restart_date="${model_start_date_next_cycle}"
    while (( restart_date <= forecast_end_cycle )); do
      restart_dates+=("${restart_date:0:8}.${restart_date:8:2}0000")
      restart_date=$(date --utc -d "${restart_date:0:8} ${restart_date:8:2} + ${restart_interval} hours" +%Y%m%d%H)
    done
  elif [[ "${RUN}" == "gfs" || "${RUN}" == "gefs" ]]; then # Copy restarts at the end of the forecast segment for RUN=gfs|gefs
    if [[ "${COPY_FINAL_RESTARTS}" == "YES" ]]; then
      restart_dates+=("${forecast_end_cycle:0:8}.${forecast_end_cycle:8:2}0000")
    fi
  fi

  ### Check that there are restart files to copy
  if [[ ${#restart_dates} -gt 0 ]]; then
    # Get list of FV3 restart files
    local file_list fv3_file
    file_list=$(FV3_restarts)

    # Copy restarts for the dates collected above to COM
    for restart_date in "${restart_dates[@]}"; do
      echo "Copying FV3 restarts for 'RUN=${RUN}' at ${restart_date}"
      for fv3_file in ${file_list}; do
        ${NCP} "${DATArestart}/FV3_RESTART/${restart_date}.${fv3_file}" \
               "${COMOUT_ATMOS_RESTART}/${restart_date}.${fv3_file}"
      done
    done

    echo "SUB ${FUNCNAME[0]}: Output data for FV3 copied"
  fi
}
DavidHuber-NOAA commented 1 week ago

@InnocentSouopgui-NOAA I have put the fix in place. Can you check it again on S4? You'll probably need to restart the experiment completely.

InnocentSouopgui-NOAA commented 1 week ago

I successfully ran a couple of tests on S4 and Jet.

I submitted a pull request on @DavidHuber-NOAA global workflow repos with a couple of modifications I needed to get thinks running smoothly.

emcbot commented 1 week ago

CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2672

emcbot commented 1 week ago

CI Update on Wcoss2 at 06/23/24 06:36:08 AM
============================================
Cloning and Building global-workflow PR: 2672
with PID: 139980 on host: dlogin08
emcbot commented 1 week ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Sun Jun 23 06:40:00 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/23/24 07:14:33 AM
Case setup: Completed for experiment C48_ATM_cb03a525
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_cb03a525
Case setup: Skipped for experiment C48_S2SWA_gefs_cb03a525
Case setup: Completed for experiment C48_S2SW_cb03a525
Case setup: Completed for experiment C96_atm3DVar_extended_cb03a525
Case setup: Skipped for experiment C96_atm3DVar_cb03a525
Case setup: Skipped for experiment C96_atmaerosnowDA_cb03a525
Case setup: Completed for experiment C96C48_hybatmDA_cb03a525
Case setup: Completed for experiment C96C48_ufs_hybatmDA_cb03a525
emcbot commented 1 week ago

Experiment C96C48_ufs_hybatmDA_cb03a525 FAIL on Wcoss2 at 06/23/24 08:12:27 AM

Error logs:

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_cb03a525/logs/2024022400/enkfgdasatmensanlfv3inc.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 1 week ago

CI Passed Hercules at
Built and ran in directory /work2/noaa/stmp/CI/HERCULES/2672

DavidHuber-NOAA commented 1 week ago

Fixed an issue with the atmensanlfv3inc declaration in config.resources that was causing the job to fail on WCOSS2. Relaunching WCOSS2 CI.

emcbot commented 1 week ago

CI Update on Wcoss2 at 06/23/24 11:09:52 AM
============================================
Cloning and Building global-workflow PR: 2672
with PID: 5467 on host: dlogin08
emcbot commented 1 week ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Sun Jun 23 11:14:49 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/23/24 11:50:02 AM
Case setup: Completed for experiment C48_ATM_184d2359
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_184d2359
Case setup: Skipped for experiment C48_S2SWA_gefs_184d2359
Case setup: Completed for experiment C48_S2SW_184d2359
Case setup: Completed for experiment C96_atm3DVar_extended_184d2359
Case setup: Skipped for experiment C96_atm3DVar_184d2359
Case setup: Skipped for experiment C96_atmaerosnowDA_184d2359
Case setup: Completed for experiment C96C48_hybatmDA_184d2359
Case setup: Completed for experiment C96C48_ufs_hybatmDA_184d2359
emcbot commented 1 week ago

Experiment C48_ATM_184d2359 SUCCESS on Wcoss2 at 06/23/24 01:00:14 PM

emcbot commented 1 week ago

Experiment C48_S2SW_184d2359 SUCCESS on Wcoss2 at 06/23/24 01:12:34 PM

emcbot commented 1 week ago

Experiment C96C48_hybatmDA_184d2359 SUCCESS on Wcoss2 at 06/23/24 02:04:15 PM

emcbot commented 1 week ago

Experiment C96C48_ufs_hybatmDA_184d2359 SUCCESS on Wcoss2 at 06/23/24 02:12:19 PM

emcbot commented 1 week ago

Experiment C96_atm3DVar_extended_184d2359 **** on Wcoss2 at 06/23/24 08:04:19 PM

Error logs:

Follow link here to view the contents of the above file(s): [(link)]()

WalterKolczynski-NOAA commented 1 week ago

None of the gempak jobs ran on WCOSS even thought the dependencies are satisfied:

WCOSS2 (BACKUPSYS) C96_atm3DVar_extended_184d2359> rocotocheck -d C96_atm3DVar_extended_184d2359.db -w C96_atm3DVar_extended_184d2359.xml -t gdasgempak -c 202112210000

Task: gdasgempak
  account: GFS-DEV
  command: /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/global-workflow/jobs/rocoto/gempak.sh
  cores: 2
  cycledefs: gdas
  final: false
  jobname: C96_atm3DVar_extended_184d2359_gdasgempak_00
  join: /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/COMROOT/C96_atm3DVar_extended_184d2359/logs/2021122100/gdasgempak.log
  maxtries: 2
  memory: 4GB
  name: gdasgempak
  nodes: 1:ppn=2:tpp=
  queue: dev
  throttle: 9999999
  walltime: 03:00:00
  environment
    CDATE ==> 2021122100
    CDUMP ==> gdas
    COMROOT ==> /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/COMROOT
    DATAROOT ==> /lfs/h2/emc/stmp/terry.mcguinness/RUNDIRS/C96_atm3DVar_extended_184d2359
    EXPDIR ==> /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/EXPDIR/C96_atm3DVar_extended_184d2359
    HOMEgfs ==> /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/global-workflow
    NET ==> gfs
    PDY ==> 20211221
    RUN ==> gdas
    RUN_ENVIR ==> emc
    cyc ==> 00
  dependencies
    SOME is satisfied
      gdasatmos_prod_f000 of cycle 202112210000 is SUCCEEDED
      gdasatmos_prod_f003 of cycle 202112210000 is SUCCEEDED
      gdasatmos_prod_f006 of cycle 202112210000 is SUCCEEDED
      gdasatmos_prod_f009 of cycle 202112210000 is SUCCEEDED

Cycle: 202112210000
  Valid for this task: YES
  State: active
  Activated: 2024-06-23 11:55:17 UTC
  Completed: -
  Expired: -

Job: This task has not been submitted for this cycle

Task can not be submitted because:

Is the job specification malformed, preventing submission to torque?

DavidHuber-NOAA commented 1 week ago

@WalterKolczynski-NOAA Looking at the rocoto log file /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/RUNTESTS/EXPDIR/C96_atm3DVar_extended_184d2359/logs/2021122100.log, I am seeing the message

2220 2024-06-23 19:50:13 +0000 :: dlogin09 :: Submission of gdasgempak failed!  qsub: Unknown resource: tpp
2221 Job submit error: 187.

Looking in the xml, I see

<task name="gdasgempak" cycledefs="gdas" maxtries="&MAXTRIES;">

   <command>/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2672/global-workflow/jobs/rocoto/gempak.sh</command>

   <jobname><cyclestr>C96_atm3DVar_extended_184d2359_gdasgempak_@H</cyclestr></jobname>
   <account>GFS-DEV</account>
   <queue>dev</queue>
   <walltime>03:00:00</walltime>
   <nodes>1:ppn=2:tpp=</nodes>

So it seems tpp is not being set. Looking in config.resources, I see the issue:

  "gempak")
    export wtime_gempak="03:00:00"
    export npe_gempak_gdas=2
    export npe_gempak_gfs=28
    export npe_node_gempak_gdas=2
    export npe_node_gempak_gfs=28
    export nth_gempak=1
    export memory_gempak_gdas="4GB"
    export memory_gempak_gfs="2GB"

    var_npe_node="npe_node_gempak_${RUN}"
    var_nth="nth_gempak_${RUN}"
    var_npe="npe_gempak_${RUN}"
    # RUN is set to a single value at setup time, so these won't be found
    # TODO rework setup_xml.py to initialize RUN to the applicable option
    if [[ -n "${!var_npe_node+0}" ]]; then
      declare -x "npe_node_gempak"="${!var_npe_node}" \
                 "nth_gempak"="${!var_nth}" \                  
                 "npe_gempak"="${!var_npe}"
    fi
    ;;

So nth_gempak is being set to ${nth_gempak_gdas}, which is undefined. I removed this assignment and am now rerunning the C96_atm3DVar_extended test on WCOSS2.

DavidHuber-NOAA commented 1 week ago

gdasgempak and gdasgempakmetancdc completed successfully on the first full cycle:


       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202112210000              gdasgempak                   153723934           SUCCEEDED                   0         1         329.0
202112210000      gdasgempakmetancdc                   153724367           SUCCEEDED                   0         1          27.0
DavidHuber-NOAA commented 1 week ago

@WalterKolczynski-NOAA The manual C96_atm3DVar_extended test passed on WCOSS2.

emcbot commented 1 week ago

CI Update on Wcoss2 at 06/24/24 05:16:04 PM
============================================
Cloning and Building global-workflow PR: 2672
with PID: 89079 on host: dlogin08
emcbot commented 1 week ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Mon Jun 24 17:20:18 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/24/24 05:55:10 PM
Case setup: Completed for experiment C48_ATM_fe859507
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_fe859507
Case setup: Skipped for experiment C48_S2SWA_gefs_fe859507
Case setup: Completed for experiment C48_S2SW_fe859507
Case setup: Completed for experiment C96_atm3DVar_extended_fe859507
Case setup: Skipped for experiment C96_atm3DVar_fe859507
Case setup: Skipped for experiment C96_atmaerosnowDA_fe859507
Case setup: Completed for experiment C96C48_hybatmDA_fe859507
Case setup: Completed for experiment C96C48_ufs_hybatmDA_fe859507
emcbot commented 1 week ago

Experiment C48_ATM_fe859507 SUCCESS on Wcoss2 at 06/24/24 07:48:18 PM

emcbot commented 1 week ago

Experiment C48_S2SW_fe859507 SUCCESS on Wcoss2 at 06/24/24 07:48:22 PM

emcbot commented 1 week ago

Experiment C96C48_hybatmDA_fe859507 SUCCESS on Wcoss2 at 06/24/24 09:16:19 PM

emcbot commented 1 week ago

Experiment C96C48_ufs_hybatmDA_fe859507 SUCCESS on Wcoss2 at 06/24/24 09:16:23 PM

emcbot commented 1 week ago

Experiment C96_atm3DVar_extended_fe859507 SUCCESS on Wcoss2 at 06/25/24 05:36:31 AM

emcbot commented 1 week ago

All CI Test Cases Passed on Wcoss2:


Experiment C48_ATM_fe859507 *** SUCCESS *** at 06/24/24 07:48:18 PM
Experiment C48_S2SW_fe859507 *** SUCCESS *** at 06/24/24 07:48:22 PM
Experiment C96C48_hybatmDA_fe859507 *** SUCCESS *** at 06/24/24 09:16:19 PM
Experiment C96C48_ufs_hybatmDA_fe859507 *** SUCCESS *** at 06/24/24 09:16:23 PM
Experiment C96_atm3DVar_extended_fe859507 *** SUCCESS *** at 06/25/24 05:36:31 AM