NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 161 forks source link

workflow/hosts.py returns hercules on orion following rocky 9 upgrade #2695

Open RussTreadon-NOAA opened 1 week ago

RussTreadon-NOAA commented 1 week ago

What is wrong?

Following the Orion Rocky 9 upgrade, workflow/hosts.py returns machine-HERCULES

What should have happened?

workflow/hosts.py should return machine=ORION when executed on Orion.

What machines are impacted?

Orion

Steps to reproduce

  1. clone g-w develop on Orion
  2. write short script to execute Host()
    
    #!/usr/bin/env python3
    import os
    import socket
    import platform
    from hosts import Host

def main(): host = Host() print(f" ") print(f"Host() is {host}") print(f" ")

if name == 'main':

main()
3. execute script on Orion and get

orion-login-4:/work2/noaa/da/rtreadon/git/global-workflow/develop/workflow$ ./test.py

machine is HERCULES info is {'BASE_GIT': '/work/noaa/global/glopara/git_rocky9', 'DMPDIR': '/work/noaa/rstprod/dump', 'BASE_CPLIC': '/work/noaa/global/glopara/data/ICSDIR/prototype_ICs', 'PACKAGEROOT': '/work/noaa/global/glopara/nwpara', 'COMINsyn': '/work/noaa/global/glopara/com/gfs/prod/syndat', 'HOMEDIR': '/work/noaa/global/${USER}', 'STMP': '/work/noaa/stmp/${USER}/HERCULES', 'PTMP': '/work/noaa/stmp/${USER}/HERCULES', 'NOSCRUB': '$HOMEDIR', 'SCHEDULER': 'slurm', 'ACCOUNT': 'fv3-cpu', 'ACCOUNT_SERVICE': 'fv3-cpu', 'QUEUE': 'batch', 'QUEUE_SERVICE': 'batch', 'PARTITION_BATCH': 'hercules', 'PARTITION_SERVICE': 'service', 'RESERVATION': '', 'CHGRP_RSTPROD': 'YES', 'CHGRP_CMD': 'chgrp rstprod', 'HPSSARCH': 'NO', 'HPSS_PROJECT': 'emc-global', 'LOCALARCH': 'NO', 'ATARDIR': '${NOSCRUB}/archive_rotdir/${PSLOT}', 'MAKE_NSSTBUFR': 'NO', 'MAKE_ACFTBUFR': 'NO', 'SUPPORTED_RESOLUTIONS': ['C1152', 'C768', 'C384', 'C192', 'C96', 'C48'], 'COMINecmwf': '/work/noaa/global/glopara/data/external_gempak/ecmwf', 'COMINnam': '/work/noaa/global/glopara/data/external_gempak/nam', 'COMINukmet': '/work/noaa/global/glopara/data/external_gempak/ukmet'} scheduler is slurm


### Additional information

The `detect` method in `hosts.py` contains 
    elif os.path.exists('/work/noaa'):
        if os.path.exists('/apps/other'):
            machine = 'HERCULES'
        else:
            machine = 'ORION'
This logic no longer works on Orion following the Rocky 9 upgrade.  Directory `/apps/other` exists on Orion.  Thus, `machine` is set to `HERCULES`

### Do you have a proposed solution?

g-w `ush/detect_machine.sh` has similar faulty logic

elif [[ -d /work ]]; then

We are on MSU Orion or Hercules

if [[ -d /apps/other ]]; then

We are on Hercules

MACHINE_ID=hercules

else MACHINE_ID=orion fi

Execution of the above on Orion now returns `MACHINE_ID=hercules`.   However, before this section of `detect_machine.sh`, the script uses `hostname -f` to also set `MACHINE_ID`.  For Orion and Hercules, the script has the lines

Orion-login-[1-4].HPC.MsState.Edu) MACHINE_ID=orion ;; ### orion1-4

[Hh]ercules-login-[1-4].[Hh][Pp][Cc].[Mm]s[Ss]tate.[Ee]du) MACHINE_ID=hercules ;; ### hercules1-4

This section of the script correctly sets `MACHINE_ID=orion`.

The python `socket.gethostname()` or `platform.node()` return the hostname.  Add these to the test python script

!/usr/bin/env python3

import os import socket import platform from hosts import Host

def main(): host = Host() print(f" ") print(f"Host() is {host}") print(f" ")

host = socket.gethostname()
print(f" ")
print(f"socket.gethostname() is {host}")
print(f" ")

host = platform.node()
print(f" ")
print(f"platform.node() is {host}")
print(f" ")

if name == 'main':

main()
Execute on Orion and get

machine is HERCULES info is {'BASE_GIT': '/work/noaa/global/glopara/git_rocky9', 'DMPDIR': '/work/noaa/rstprod/dump', 'BASE_CPLIC': '/work/noaa/global/glopara/data/ICSDIR/prototype_ICs', 'PACKAGEROOT': '/work/noaa/global/glopara/nwpara', 'COMINsyn': '/work/noaa/global/glopara/com/gfs/prod/syndat', 'HOMEDIR': '/work/noaa/global/${USER}', 'STMP': '/work/noaa/stmp/${USER}/HERCULES', 'PTMP': '/work/noaa/stmp/${USER}/HERCULES', 'NOSCRUB': '$HOMEDIR', 'SCHEDULER': 'slurm', 'ACCOUNT': 'fv3-cpu', 'ACCOUNT_SERVICE': 'fv3-cpu', 'QUEUE': 'batch', 'QUEUE_SERVICE': 'batch', 'PARTITION_BATCH': 'hercules', 'PARTITION_SERVICE': 'service', 'RESERVATION': '', 'CHGRP_RSTPROD': 'YES', 'CHGRP_CMD': 'chgrp rstprod', 'HPSSARCH': 'NO', 'HPSS_PROJECT': 'emc-global', 'LOCALARCH': 'NO', 'ATARDIR': '${NOSCRUB}/archive_rotdir/${PSLOT}', 'MAKE_NSSTBUFR': 'NO', 'MAKE_ACFTBUFR': 'NO', 'SUPPORTED_RESOLUTIONS': ['C1152', 'C768', 'C384', 'C192', 'C96', 'C48'], 'COMINecmwf': '/work/noaa/global/glopara/data/external_gempak/ecmwf', 'COMINnam': '/work/noaa/global/glopara/data/external_gempak/nam', 'COMINukmet': '/work/noaa/global/glopara/data/external_gempak/ukmet'} scheduler is slurm

Host() is <hosts.Host object at 0x7f4107a45fd0>

socket.gethostname() is orion-login-4.hpc.msstate.edu

platform.node() is orion-login-4.hpc.msstate.edu



Can we use `socket.gethostname()` or `platform.node()` to return the login hostname and from this set `machine` accordingly?
aerorahul commented 1 week ago

Thanks @RussTreadon-NOAA The detection of the machine based on filesystem in detect_machine.sh is to aid on compute nodes. On compute nodes hostname -f does not always return the same string as on a login node.
Using socket is fine in hosts.py. Would you have a solution for detect_machine.sh part of the script when executed on the compute node?

RussTreadon-NOAA commented 1 week ago

Thank you @aerorahul for the information. I was unaware of this fact. I do not have a solution for the compute node section of detect_machine.sh

DavidHuber-NOAA commented 1 week ago

We may be able to grep /etc/fstab (or the output of df) for orion-nfs or hercules-nfs to discern between the two machines.

DavidHuber-NOAA commented 1 week ago

Better yet:

[[ $(findmnt -n -o SOURCE /home) =~ "hercules" ]] && echo "Hercules"
[[ $(findmnt -n -o SOURCE /home) =~ "orion" ]] && echo "Orion"