NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Eliminate post groups #2667

Closed WalterKolczynski-NOAA closed 1 week ago

WalterKolczynski-NOAA commented 3 weeks ago

Description

Eliminates the post groups used for upp and products jobs so that each task only processes one forecast hour. This is more efficient and greatly simplifies downstream dependencies that depend on a specific forecast hour.

Resolves #2666 Refs #2642

Type of change

Change characteristics

How has this been tested?

Checklist

JessicaMeixner-NOAA commented 3 weeks ago

@WalterKolczynski-NOAA I'm just curious how this effects things like really long runs - say SFS length runs. The grouping of the post jobs means way less post jobs, which might be nice for some systems. I do appreciate the simplification though. Just wanted to put this in the discussion since I recently ran a long run and well, it was a lot of post jobs!!!

aerorahul commented 3 weeks ago

More or less (post) jobs translate into longer or shorter walltime requests. Theoretically, breaking the group into smaller requests (walltime, etc), gets through the batch scheduler faster. It should also help in debugging/restarting failed jobs without rerunning successful parts of the failed group. Collecting the jobs in groups is the developer's way of gaming the scheduler. Its appealing, but someone needs to show objectively that having fewer jobs with longer wall times is faster than more jobs with shorter turn around. One could put all the jobs in a single job request and churn through as many cycles, tasks as possible, resulting in a single job, but thats an extreme (it has been seen!) This would eliminate the need for a workflow or workflow manager. It will also reduce the number of jobs in total.
The number of jobs in the workflow is perhaps not something we should worry about, but how they are executed and managed/driven.

Considering that this workflow will work on big and small machines, as well as on the cloud, this change manages resources better.

That's my 2c.

GwenChen-NOAA commented 3 weeks ago

@WalterKolczynski-NOAA, I'm working on breaking the gempak job from running all forecast hours at once (more than one hour runtime) to processing one forecast hour at a time, same as the awips job. The new scripts run successfully on WCOSS2 using my driver script, and it reduces the runtime to less than 4 min per task. It requires some workflow changes to accommodate the new scripts. Can those changes be made in this PR?

WalterKolczynski-NOAA commented 3 weeks ago

@WalterKolczynski-NOAA, I'm working on breaking the gempak job from running all forecast hours at once (more than one hour runtime) to processing one forecast hour at a time, same as the awips job. The new scripts run successfully on WCOSS2 using my driver script, and it reduces the runtime to less than 4 min per task. It requires some workflow changes to accommodate the new scripts. Can those changes be made in this PR?

They should be there own PR, but this PR should facilitate that work by making the prerequisites easier (the gempak job will only have to wait for the products job of the same forecast hour, instead of all of them). I'm helping Bo do the same thing for the sounding job. I'm on leave most of today, but send me an email reminder and we can coordinate early next week.

GwenChen-NOAA commented 3 weeks ago

They should be there own PR, but this PR should facilitate that work by making the prerequisites easier (the gempak job will only have to wait for the products job of the same forecast hour, instead of all of them). I'm helping Bo do the same thing for the sounding job. I'm on leave most of today, but send me an email reminder and we can coordinate early next week.

Thanks, @WalterKolczynski-NOAA! I will open a new PR to upload my new gempak scripts, so you can take a look. We can coordinate next week.

emcbot commented 3 weeks ago

CI Update on Wcoss2 at 06/08/24 03:16:18 AM
============================================
Cloning and Building global-workflow PR: 2667
with PID: 7183 on host: clogin05
emcbot commented 3 weeks ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Sat Jun  8 03:24:50 UTC 2024 on clogin05
---------------------------------------------------
Build: Completed at 06/08/24 03:58:53 AM
Case setup: Completed for experiment C48_ATM_11a65931
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_11a65931
Case setup: Skipped for experiment C48_S2SWA_gefs_11a65931
Case setup: Completed for experiment C48_S2SW_11a65931
Case setup: Completed for experiment C96_atm3DVar_extended_11a65931
Case setup: Skipped for experiment C96_atm3DVar_11a65931
Case setup: Skipped for experiment C96_atmaerosnowDA_11a65931
Case setup: Completed for experiment C96C48_hybatmDA_11a65931
Case setup: Completed for experiment C96C48_ufs_hybatmDA_11a65931
emcbot commented 3 weeks ago

Experiment C48_ATM_11a65931 SUCCESS on Wcoss2 at 06/08/24 05:06:21 AM

emcbot commented 3 weeks ago

Experiment C48_S2SW_11a65931 **** on Wcoss2 at 06/08/24 05:18:23 AM

Error logs:

Follow link here to view the contents of the above file(s): [(link)]()

emcbot commented 3 weeks ago

Experiment C48_S2SWA_gefs FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C48_S2SWA_gefs_11a65931

emcbot commented 3 weeks ago

Experiment C48_ATM FAILED on Hercules with error logs:

/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C48_ATM_11a65931/logs/2021032312/gfsatmos_prod_f000.log
/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C48_ATM_11a65931/logs/2021032312/gfsatmos_prod_f003.log
/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C48_ATM_11a65931/logs/2021032312/gfsatmos_prod_f006.log
/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C48_ATM_11a65931/logs/2021032312/gfsatmos_prod_f009.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 3 weeks ago

Experiment C48_S2SW FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C48_S2SW_11a65931

emcbot commented 3 weeks ago

Experiment C48_ATM FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C48_ATM_11a65931

emcbot commented 3 weeks ago

Experiment C96C48_hybatmDA FAILED on Hercules with error logs:

/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C96C48_hybatmDA_11a65931/logs/2021122100/gfsatmos_prod_f012.log
/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C96C48_hybatmDA_11a65931/logs/2021122100/gfsatmos_prod_f015.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 3 weeks ago

Experiment C96C48_hybatmDA FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C96C48_hybatmDA_11a65931

emcbot commented 3 weeks ago

Experiment C96_atm3DVar FAILED on Hercules with error logs:

/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C96_atm3DVar_11a65931/logs/2021122100/gfsatmos_prod_f012.log
/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C96_atm3DVar_11a65931/logs/2021122100/gfsatmos_prod_f015.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 3 weeks ago

Experiment C96_atm3DVar FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C96_atm3DVar_11a65931

emcbot commented 3 weeks ago

Experiment C48_S2SW FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C48_S2SW_11a65931

TerrenceMcGuinness-NOAA commented 3 weeks ago

Timed out on Orion: https://github.com/emcbot/ci-global-workflows/tree/error_logs/ci/error_logs/timed_out

WalterKolczynski-NOAA commented 3 weeks ago

The wrong forecast hour lists are being generated for marine. Will look into it.

TerrenceMcGuinness-NOAA commented 3 weeks ago

Creating the experment C48_S2SWA_gefs failed on Hera on KeyError: 'FHOUT_OCNICE

Terry.McGuinness (hfe11) CI $ /scratch1/NCEPDEV/global/CI/2667/gefs/ci/scripts/utils/ci_utils_wrapper.sh create_experiment /scratch1/NCEPDEV/global/CI/2667/gefs/ci/cases/pr/C48_S2SWA_gefs.yaml

Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: None
2024-06-11 15:40:28,849 - INFO     - root        : BEGIN: __main__.input_args
2024-06-11 15:40:28,849 - DEBUG    - root        : (  )
2024-06-11 15:40:28,850 - INFO     - root        :   END: __main__.input_args
2024-06-11 15:40:28,850 - DEBUG    - root        :  returning: Namespace(yaml=PosixPath('/scratch1/NCEPDEV/global/CI/2667/gefs/ci/cases/pr/C48_S2SWA_gefs.yaml'), overwrite=True)
2024-06-11 15:40:28,863 - INFO     - root        : Call: setup_expt.main()
2024-06-11 15:40:28,863 - DEBUG    - root        : setup_expt.py gefs forecast-only --pslot C48_S2SWA_gefs_6483983d --app S2SWA --resdetatmos 48 --resdetocean 5.0 --resensatmos 48 --nens 2 --gfs_cyc 1 --start cold --comroot UNDEFINED/COMROOT --expdir UNDEFINED/EXPDIR --idate 2021032312 --edate 2021032312 --yaml /scratch1/NCEPDEV/global/CI/2667/gefs/ci/cases/yamls/gefs_ci_defaults.yaml --overwrite
forecast-only mode treats ICs differently and cannot be staged here
EDITED:  UNDEFINED/EXPDIR/C48_S2SWA_gefs_6483983d/config.base as per user input.
****************************************************************************************************
EXPDIR: UNDEFINED/EXPDIR/C48_S2SWA_gefs_6483983d
ROTDIR: UNDEFINED/COMROOT/C48_S2SWA_gefs_6483983d
****************************************************************************************************
2024-06-11 15:40:29,019 - INFO     - root        : Call: setup_xml.main()
2024-06-11 15:40:29,019 - DEBUG    - root        : setup_xml.py /scratch1/NCEPDEV/global/CI/2667/gefs/UNDEFINED/EXPDIR/C48_S2SWA_gefs_6483983d
Finalizing initialize
sourcing config.stage_ic
sourcing config.fcst
sourcing config.atmos_products
sourcing config.efcs
sourcing config.atmos_ensstat
sourcing config.waveinit
sourcing config.wavepostsbs
sourcing config.wavepostpnt
sourcing config.oceanice_products
sourcing config.prep_emissions
component='atmos' fhmax=120 fhout=6
component='atmos' fhmax=120 fhout=6
Traceback (most recent call last):
  File "/scratch1/NCEPDEV/global/CI/2667/gefs//workflow/create_experiment.py", line 101, in <module>
    setup_xml.main(setup_xml_args)
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/setup_xml.py", line 73, in main
    xml = rocoto_xml_factory.create(f'{net}_{mode}', app_config, rocoto_param_dict)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/wxflow/factory.py", line 71, in create
    return self._builders[key](*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/rocoto/gefs_xml.py", line 14, in __init__
    super().__init__(app_config, rocoto_config)
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/rocoto/workflow_xml.py", line 27, in __init__
    task_list = get_wf_tasks(app_config)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/rocoto/workflow_tasks.py", line 21, in get_wf_tasks
    tasks.append(task_obj.get_task(task_name))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/rocoto/tasks.py", line 240, in get_task
    return getattr(self, task_name, *args, **kwargs)()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/rocoto/gefs_tasks.py", line 194, in ocean_prod
    return self._atmosoceaniceprod('ocean')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/rocoto/gefs_tasks.py", line 250, in _atmosoceaniceprod
    fhrs = self._get_forecast_hours('gefs', self._configs[config], component)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/rocoto/tasks.py", line 136, in _get_forecast_hours
    local_config['FHOUT'] = config['FHOUT_OCNICE']
                            ~~~~~~^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/global/CI/2667/gefs/workflow/wxflow/attrdict.py", line 84, in __missing__
    raise KeyError(name)
KeyError: 'FHOUT_OCNICE'
Terry.McGuinness (hfe11) CI $ 
emcbot commented 3 weeks ago

Experiment C48_S2SWA_gefs FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2667/RUNTESTS/COMROOT/C48_S2SWA_gefs_b10c4867/logs/2021032312/prep_emissions.log
/scratch1/NCEPDEV/global/CI/2667/RUNTESTS/COMROOT/C48_S2SWA_gefs_b10c4867/logs/2021032312/stage_ic.log
/scratch1/NCEPDEV/global/CI/2667/RUNTESTS/COMROOT/C48_S2SWA_gefs_b10c4867/logs/2021032312/wave_init.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 3 weeks ago

Experiment C48_S2SWA_gefs FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C48_S2SWA_gefs_b10c4867

emcbot commented 3 weeks ago

Experiment C48_ATM FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C48_ATM_b10c4867

emcbot commented 3 weeks ago

Experiment C96C48_hybatmDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C96C48_hybatmDA_b10c4867

emcbot commented 3 weeks ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C48mx500_3DVarAOWCDA_b10c4867

emcbot commented 3 weeks ago

Experiment C96_atmaerosnowDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C96_atmaerosnowDA_b10c4867

emcbot commented 3 weeks ago

Experiment C48_S2SW FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C48_S2SW_b10c4867

emcbot commented 3 weeks ago

Experiment C96_atm3DVar FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C96_atm3DVar_b10c4867

emcbot commented 3 weeks ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2667/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_22f814ad/logs/2021032418/gdasocnanalbmat.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 3 weeks ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C48mx500_3DVarAOWCDA_22f814ad

emcbot commented 3 weeks ago

Experiment C96_atmaerosnowDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2667/RUNTESTS/COMROOT/C96_atmaerosnowDA_22f814ad/logs/2021122018/gdasprepsnowobs.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 3 weeks ago

Experiment C96_atmaerosnowDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C96_atmaerosnowDA_22f814ad

emcbot commented 2 weeks ago

CI Update on Wcoss2 at 06/17/24 08:06:15 PM
============================================
Cloning and Building global-workflow PR: 2667
with PID: 10303 on host: dlogin08
emcbot commented 2 weeks ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Mon Jun 17 20:10:53 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/17/24 08:46:25 PM
*** Failed *** to create experiment: C48_ATM_d63f0b62 on Wcoss2

Traceback (most recent call last):
  File "/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2667/global-workflow/workflow/create_experiment.py", line 29, in <module>
    import setup_xml
  File "/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2667/global-workflow/workflow/setup_xml.py", line 10, in <module>
    from rocoto.rocoto_xml_factory import rocoto_xml_factory
  File "/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2667/global-workflow/workflow/rocoto/rocoto_xml_factory.py", line 2, in <module>
    from rocoto.gfs_cycled_xml import GFSCycledRocotoXML
  File "/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2667/global-workflow/workflow/rocoto/gfs_cycled_xml.py", line 3, in <module>
    from rocoto.workflow_xml import RocotoXML
  File "/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2667/global-workflow/workflow/rocoto/workflow_xml.py", line 9, in <module>
    from rocoto.workflow_tasks import get_wf_tasks
  File "/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2667/global-workflow/workflow/rocoto/workflow_tasks.py", line 5, in <module>
    from rocoto.tasks_factory import tasks_factory
  File "/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2667/global-workflow/workflow/rocoto/tasks_factory.py", line 2, in <module>
    from rocoto.gfs_tasks import GFSTasks
  File "/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2667/global-workflow/workflow/rocoto/gfs_tasks.py", line 2, in <module>
    from rocoto.tasks import Tasks
  File "<fstring>", line 1
    (component=)
              ^
SyntaxError: invalid syntax
emcbot commented 2 weeks ago

CI Update on Wcoss2 at 06/17/24 09:00:45 PM
============================================
Cloning and Building global-workflow PR: 2667
with PID: 151379 on host: dlogin08
emcbot commented 2 weeks ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Mon Jun 17 21:04:53 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/17/24 09:40:35 PM
Case setup: Completed for experiment C48_ATM_a29372b9
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_a29372b9
Case setup: Skipped for experiment C48_S2SWA_gefs_a29372b9
Case setup: Completed for experiment C48_S2SW_a29372b9
Case setup: Completed for experiment C96_atm3DVar_extended_a29372b9
Case setup: Skipped for experiment C96_atm3DVar_a29372b9
Case setup: Skipped for experiment C96_atmaerosnowDA_a29372b9
Case setup: Completed for experiment C96C48_hybatmDA_a29372b9
Case setup: Completed for experiment C96C48_ufs_hybatmDA_a29372b9
emcbot commented 2 weeks ago

Experiment C48_ATM_a29372b9 SUCCESS on Wcoss2 at 06/17/24 10:52:13 PM

emcbot commented 2 weeks ago

Experiment C48_S2SW_a29372b9 SUCCESS on Wcoss2 at 06/17/24 11:04:17 PM

emcbot commented 2 weeks ago

Experiment C96C48_ufs_hybatmDA_a29372b9 SUCCESS on Wcoss2 at 06/18/24 12:12:17 AM

emcbot commented 2 weeks ago

Experiment C96C48_hybatmDA_a29372b9 SUCCESS on Wcoss2 at 06/18/24 12:24:16 AM

emcbot commented 2 weeks ago

Experiment C96_atm3DVar_extended_a29372b9 SUCCESS on Wcoss2 at 06/18/24 05:56:30 AM

emcbot commented 2 weeks ago

All CI Test Cases Passed on Wcoss2:


Experiment C48_ATM_a29372b9 *** SUCCESS *** at 06/17/24 10:52:13 PM
Experiment C48_S2SW_a29372b9 *** SUCCESS *** at 06/17/24 11:04:17 PM
Experiment C96C48_ufs_hybatmDA_a29372b9 *** SUCCESS *** at 06/18/24 12:12:17 AM
Experiment C96C48_hybatmDA_a29372b9 *** SUCCESS *** at 06/18/24 12:24:16 AM
Experiment C96_atm3DVar_extended_a29372b9 *** SUCCESS *** at 06/18/24 05:56:30 AM
emcbot commented 2 weeks ago

Experiment C48_S2SWA_gefs FAILED on Hercules with error logs:

/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C48_S2SWA_gefs_10bcdcba/logs/2021032312/prep_emissions.log
/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C48_S2SWA_gefs_10bcdcba/logs/2021032312/stage_ic.log
/work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/COMROOT/C48_S2SWA_gefs_10bcdcba/logs/2021032312/wave_init.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 2 weeks ago

Experiment C48_S2SWA_gefs FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C48_S2SWA_gefs_10bcdcba

emcbot commented 2 weeks ago

Experiment C96C48_hybatmDA FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C96C48_hybatmDA_10bcdcba

emcbot commented 2 weeks ago

Experiment C48_ATM FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C48_ATM_10bcdcba

emcbot commented 2 weeks ago

Experiment C96_atm3DVar FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C96_atm3DVar_10bcdcba

emcbot commented 2 weeks ago

Experiment C48_S2SW FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2667/RUNTESTS/C48_S2SW_10bcdcba

emcbot commented 2 weeks ago

Experiment C48_S2SW FAILED on Hera in /scratch1/NCEPDEV/global/CI/2667/RUNTESTS/C48_S2SW_10bcdcba