NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
74 stars 167 forks source link

Port to phase 3.5 #21

Closed KateFriedman-NOAA closed 4 years ago

KateFriedman-NOAA commented 4 years ago

Port global-workflow to the new WCOSS phase 3.5.

KateFriedman-NOAA commented 4 years ago

Pulled in checklist from Hera port but may not need many items. We may just need to use a different queue. Fanglin is going to test and confirm with NCO before we go too far.

yangfanglin commented 4 years ago

Kate,

I have started to run a cycled experiment on Phase 3.5 on Venus. Porting is not needed. Only a few minor changes to the scripts are required to make the best use of the 40 tasks per node.

Fanglin

On Mon, Feb 24, 2020 at 10:07 AM Kate Friedman notifications@github.com wrote:

Pulled in checklist from Hera port but may not need many items. We may just need to use a different queue. Fanglin is going to test and confirm with NCO before we go too far.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/21?email_source=notifications&email_token=AKY5N2ITS4TLYD5Z6H57HVTREPPDPA5CNFSM4K2JZ3MKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMYEY7I#issuecomment-590367869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY5N2NY4IWRT7UZPEXFPSLREPPDPANCNFSM4K2JZ3MA .

-- Fanglin Yang, Ph.D. Physical Scientist Environmental Modeling Center National Centers for Environmental Prediction 301-6833722; fanglin.yang@noaa.gov http://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/ http://www.emc.ncep.noaa.gov/gmb/STATS_vsdb/

KateFriedman-NOAA commented 4 years ago

@yangfanglin That's great news, thanks for reporting back! Can you point us to your global-workflow clone on 3.5? I'd like to take a peek at what changes were needed. Are you planning to bring the changes back to the feature/gfsv16b branch or would you like us to pull them into develop?

yangfanglin commented 4 years ago

Kate,

I installed a fresh copy of the workflow at /gpfs/dell6/emc/modeling/noscrub/emc.glopara/git/global-workflow/gfsv16b. My EXPDIR is /gpfs/dell6/emc/modeling/noscrub/emc.glopara/para_gfs/fv3test, which is set up using this workflow. A few minor changes are made to env*, config.fv3, config.resources etc locally. I am still running gdas cycles to make sure fv3test can reproduce v16rt2. My changes are temporary. We need to think about how to support jobs running on the same "machine" but with different "queues" which have different task count per node.

Fanglin

On Tue, Feb 25, 2020 at 8:46 AM Kate Friedman notifications@github.com wrote:

@yangfanglin https://github.com/yangfanglin That's great news, thanks for reporting back! Can you point us to your global-workflow clone on 3.5? I'd like to take a peek at what changes were needed. Are you planning to bring the changes back to the feature/gfsv16b branch or would you like us to pull them into develop?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/21?email_source=notifications&email_token=AKY5N2KATFBLC7HQGAPUJKDREUOMJA5CNFSM4K2JZ3MKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM4AP6Y#issuecomment-590874619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY5N2MXM62H63HQ45JB7ALREUOMJANCNFSM4K2JZ3MA .

-- Fanglin Yang, Ph.D. Physical Scientist Environmental Modeling Center National Centers for Environmental Prediction 301-6833722; fanglin.yang@noaa.gov http://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/ http://www.emc.ncep.noaa.gov/gmb/STATS_vsdb/

KateFriedman-NOAA commented 4 years ago

We need to think about how to support jobs running on the same "machine" but with different "queues" which have different task count per node.

@yangfanglin Gotcha, thanks for the path to the clone on 3.5.

@Hang-Lei-NOAA , I removed the checklist since we won't need it.

The $machine is defined/detected in ush/rocoto/workflow_utils.py by checking the paths:

def detectMachine():

    machines = ['HERA', 'WCOSS_C', 'WCOSS_DELL_P3']

    if os.path.exists('/scratch1/NCEPDEV'):
        return 'HERA'
    elif os.path.exists('/gpfs') and os.path.exists('/etc/SuSE-release'):
        return 'WCOSS_C'
    elif os.path.exists('/gpfs/dell2'):
        return 'WCOSS_DELL_P3'
    else:
        print 'workflow is currently only supported on: %s' % ' '.join(machines)
        raise NotImplementedError('Cannot auto-detect platform, ABORT!')

We would need something in that section to define "WCOSS_DELL_P3p5" (or other chosen $machine name). Checking paths won't work since you can see all /gpfs/dell# directories from both phase 3 and phase 3.5. Once we come up with a way to set "WCOSS_DELL_P3p5" in workflow_utils.py the rest (like setting queues) should be easy. We will need to adjust the phase 3 detection in that if-block too most likely.

Is there something on phase 3.5 that differentiates it from phase 3? An environment variable or path that is only on phase 3.5? @GeorgeVandenberghe-NOAA

KateFriedman-NOAA commented 4 years ago

FYI George V just submitted a WCOSS helpdesk ticket to ask them how to differentiate between the phases.

yangfanglin commented 4 years ago

We probably do not need to have new "machine" defined for phase 3.5 in the workflow. Rather, difference queues offered by the same machine can be used to identify the resources. Phase 3.5 uses dev2, devmax2, devonprod2, dev2_transfer queues. To achieve this goal, the python setup scripts need to be updated. The parameter npe_node_max can be set in config.base based on the queue definition.

Fanglin

On Tue, Feb 25, 2020 at 10:01 AM Kate Friedman notifications@github.com wrote:

FYI George V just submitted a WCOSS helpdesk ticket to ask them how to differentiate between the phases.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/21?email_source=notifications&email_token=AKY5N2OHPQZO5ZRFLEVGZQ3REUXC5A5CNFSM4K2JZ3MKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM4JEHY#issuecomment-590909983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY5N2P5KLFAQKYXTQOOZ6DREUXC5ANCNFSM4K2JZ3MA .

-- Fanglin Yang, Ph.D. Physical Scientist Environmental Modeling Center National Centers for Environmental Prediction 301-6833722; fanglin.yang@noaa.gov http://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/ http://www.emc.ncep.noaa.gov/gmb/STATS_vsdb/

KateFriedman-NOAA commented 4 years ago

Created branch "port2wcoss3p5" off of develop branch. Will make changes to submit to phase 3.5 queues (dev2/devonprod2).

KateFriedman-NOAA commented 4 years ago

Added "--partition" flag to setup_expt*.py scripts. Partition flag not required and does not do anything on WCOSS-Cray or Hera, only does something when invoked on WCOSS-Dell with "3p5" as the partition value. If partition flag is not invoked on WCOSS-Dell then it defaults to phase 3.

To use on WCOSS-Dell: --partition 3p5

When the partition flag is used with a value of "3p5" it will set QUEUE to "dev2" and QUEUE_ARCH to "dev2_transfer". Then when setup_workflow scripts are run it will see that QUEUE is set to a phase 3.5 queue and use the phase 3.5 ppn=40 value for resource calculations instead of ppn=28 which is phase 3's value. Additional phase 3.5 queues that QUEUE can be set to prior to running setup_workflow*.py are "devonprod2" and "devmax2".

Example for cycled mode:

./setup_expt.py --pslot testcyc --configdir /gpfs/dell2/emc/modeling/save/Kate.Friedman/git/global-workflow/port2wcoss3p5/parm/config --idate 2020010212 --edate 2020010218 --expdir /gpfs/dell2/emc/modeling/save/Kate.Friedman/expdir --comrot /gpfs/dell3/ptmp/Kate.Friedman/comrot --resdet 768 --resens 384 --nens 80 --gfs_cyc 4 --partition 3p5

Example for free-forecast mode:

./setup_expt_fcstonly.py --pslot testff --configdir /gpfs/dell2/emc/modeling/save/Kate.Friedman/git/global-workflow/port2wcoss3p5/parm/config --idate 2020010212 --edate 2020010218 --expdir /gpfs/dell2/emc/modeling/save/Kate.Friedman/expdir --comrot /gpfs/dell3/ptmp/Kate.Friedman/comrot --res 192 --gfs_cyc 4 --partition 3p5
KateFriedman-NOAA commented 4 years ago

Tested setup scripts on WCOSS-Dell, WCOSS-Cray, and Hera, with and without --partition flag. No issues.

KateFriedman-NOAA commented 4 years ago

PR complete and port2wcoss3p5 branch deleted. Closing issue. Will reopen for additional updates as needed.