NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
74 stars 165 forks source link

Migrate Jet to /lfs5 #2841

Open KateFriedman-NOAA opened 4 weeks ago

KateFriedman-NOAA commented 4 weeks ago

What new functionality do you need?

The /lfs4 filesystem has become unusable and users need to migrate to /lfs5. The global-workflow, libraries, and components (both internal and external) will need to be updated to use /lfs5.

What are the requirements for the new functionality?

The following need to be updated to use /lfs5:

The following need to be updated to use the migrated spack-stack install before global-workflow can be fully migrated:

Acceptance Criteria

All components build and run on /lfs5

Suggest a solution (optional)

No response

DavidHuber-NOAA commented 4 weeks ago

@InnocentSouopgui-NOAA Would you be able to help with this after spack-stack has been installed on /lfs5?

DavidHuber-NOAA commented 4 weeks ago

The spack-stack installation is being tracked here: https://github.com/JCSDA/spack-stack/issues/1250

InnocentSouopgui-NOAA commented 4 weeks ago

Sure, I will do that. I am already tracking the migration of spack-stack on Jet.

KateFriedman-NOAA commented 4 weeks ago

Thanks @InnocentSouopgui-NOAA ! I can take care of Fit2Obs, obsproc, and prepobs. I can probably help with other components too after those are done. I am already working on unrelated updates to obsproc and prepobs so I'll fold the Jet updates into those efforts.

KateFriedman-NOAA commented 3 weeks ago

A new spack-stack/1.6.0 install is now available under /contrib on Jet (equivalent to the gsi-addon-dev env we had before): /contrib/spack-stack/spack-stack-1.6.0/envs/gsi-addon-intel/install/modulefiles/Core

InnocentSouopgui-NOAA commented 3 weeks ago

Thanks @InnocentSouopgui-NOAA ! I can take care of Fit2Obs, obsproc, and prepobs. I can probably help with other components too after those are done. I am already working on unrelated updates to obsproc and prepobs so I'll fold the Jet updates into those efforts.

@KateFriedman-NOAA, where are you with the external dependencies? I built all the other components (that get build with build_all.sh scripts inside sorc) of Global Workflow, and want to start testing the cycling.

KateFriedman-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA Fit2Obs is done and installed on Jet here (note the new v1.1.3 version): /lfs5/HFIP/hfv3gfs/glopara/git/Fit2Obs/v1.1.3

Obsproc is in review (see https://github.com/NOAA-EMC/obsproc/pull/92). We'll be going to v1.2 with this. I will let you know when it is installed on Jet.

I am planning to work on prepobs today and combine the work with our move to the new v1.1.0 version that went into ops. Will also install this on Jet when ready and inform you.

KateFriedman-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA Updated obsproc/v1.2 is now installed on Jet: /lfs5/HFIP/hfv3gfs/glopara/git/obsproc/v1.2.0

I will have a PR shortly that will update develop to v1.2 everywhere so you can either ingest that from develop into a branch you're using or update obsproc_run_ver=1.2.0 in run.spackver` in advance.

KateFriedman-NOAA commented 2 weeks ago

From Jet admins:

Jet /lfs4 migration to /lfs5 
Due to ongoing issues with /lfs4 we request that all users migrate their active data from /lfs4 to /lfs5 
before next Wednesday 9/4. All projects that have quota on /lfs4 have been given quota on /lfs5.  
Weather permitting, on Wednesday 9/4 we plan to have an /lfs4 test outage from 1000  to ~1600 MT
to verify all /lfs4 dependences have been removed. If this test outage is successful we plan to make 
/lfs4 read only for 2 more weeks, then unmounting it ~9/17. 
KateFriedman-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA Fit2Obs, obsproc, and prepobs are now ready and installed on Jet. See the checklist in the main issue comment for paths. You'll need to update fit2obs_ver=1.1.3, obsproc_run_ver=1.2.0, and prepobs_run_ver=1.1.0 in versions/run.spack.ver to use them. Also update BASE_GIT in workflow/hosts/jet.yaml to be /lfs5/HFIP/hfv3gfs/glopara/git.

InnocentSouopgui-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA Fit2Obs, obsproc, and prepobs are now ready and installed on Jet. See the checklist in the main issue comment for paths. You'll need to update fit2obs_ver=1.1.3, obsproc_run_ver=1.2.0, and prepobs_run_ver=1.1.0 in versions/run.spack.ver to use them. Also update BASE_GIT in workflow/hosts/jet.yaml to be /lfs5/HFIP/hfv3gfs/glopara/git.

Thanks Kate, With this I will start testing the whole Global Workflow system.

InnocentSouopgui-NOAA commented 2 weeks ago

@KateFriedman-NOAA, There is an environmental variable if GSI module file that references a space on /lfs4, see below. Are you in charge of that one as well?

pushenv("GSI_BINARY_SOURCE_DIR", "/mnt/lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix/gsi/20240208")

KateFriedman-NOAA commented 2 weeks ago

@KateFriedman-NOAA, There is an environmental variable if GSI module file that references a space on /lfs4, see below. Are you in charge of that one as well?

pushenv("GSI_BINARY_SOURCE_DIR", "/mnt/lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix/gsi/20240208")

If you're talking about the fix/gsi/20240208 folder then yes. That should now be /lfs5/HFIP/hfv3gfs/glopara/FIX/fix/gsi/20240208. Ideally though that shouldn't be a hardcoded path. @RussTreadon-NOAA is there a way to make this not hardcoded?

RussTreadon-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA , please follow the change control procedure described under GSI: How to Make Changes to update the GSI_BINARY_SOURCE_DIR path in modulefiles/gsi_jet.intel.lua. (FYI, I am not the GSI code manager. We do not have a GSI code manager.)

@KateFriedman-NOAA , GSI_BINARY_SOURCE_DIR was added when EIB split the gerrit GSI fix into ASCII & Binary files. ASCII files are managed via a GSI-fix submodule hash included as GSI fix. Binary fix files are managed in EIB space. Fortunately GSI binary fix files do not change frequently. I don't have a good suggestion as to how to remove a fixed path for GSI_BINARY_SOURCE_DIR.

A softening in the hardcoded path approach would be to have GSI_BINARY_SOURCE_DIR point at an EIB maintained link. The link points at the the most recent set of GSI binary fix files. For example, GSI modulefiles/gsi_hera.intel.lua could be updated to read

pushenv("GSI_BINARY_SOURCE_DIR", "/scratch1/NCEPDEV/global/glopara/fix/gsi/latest")

where latest is a soft link currently pointing at /scratch1/NCEPDEV/global/glopara/fix/gsi/20240208.

This way EIB can change the directory at which soft link latest points at without any changes to GSI_BINARY_SOURCE_DIR.

The disadvantage of this approach is that is not readily apparent to GSI developers which snapshot of the GSI binary fix files they are using. Also, we still have a hardcoded path for GSI_BINARY_SOURCE_DIR ... it's just that we don't need to change the date string when GSI binary fix files are updated.

KateFriedman-NOAA commented 2 weeks ago

@RussTreadon-NOAA Thanks for the refresher on how GSI_BINARY_SOURCE_DIR came to be what it is. Sounds like updating the hardcoded path to the new hardcoded path is the easiest thing right now. I'm not a fan of the latest symlink for the disadvantage you outlined.

InnocentSouopgui-NOAA commented 2 weeks ago

@KateFriedman-NOAA, a couple of things that require your attention

/parm/config/gfs/config.aero point to /lfs4/HFIP/hfv3gfs/glopara/data which has not yet moved. - AERO_INPUTS_DIR="/lfs4/HFIP/hfv3gfs/glopara/data/gocart_emissions" You must be busy, but whenever, you can, please provide TC_tracker too. I need this for a full test of global workflow.
InnocentSouopgui-NOAA commented 2 weeks ago

@KateFriedman-NOAA, just a though here. will it be better to have the external dependencies of Global Workflow on /contrib so that they are storage independent? That just crossed my mind.

InnocentSouopgui-NOAA commented 2 weeks ago

@KateFriedman-NOAA other data that need to move. in summary, I thin the whole /lfs4/HFIP/hfv3gfs/glopara should be copied.

All the following set in workflow/hosts/jet.yaml reference a subdirectory of /lfs4/HFIP/hfv3gfs/glopara

KateFriedman-NOAA commented 2 weeks ago

@InnocentSouopgui-NOAA Here is the status of the various glopara folders moving from /lfs4 to /lfs5:

You don't need to worry about the other COMINs for gempak, we don't run/support gempak outside of WCOSS2. Same goes for PACKAGEROOT/nwpara folder, you can ignore that.

will it be better to have the external dependencies of Global Workflow on /contrib so that they are storage independent?

We don't have access to install on /contrib. We're also considering making the external packages into submodules of g-w develop so that would moot the need to install them.

please provide TC_tracker too

Will do, stay tuned...

InnocentSouopgui-NOAA commented 2 weeks ago

@KateFriedman-NOAA , can you add /lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs to the list of data to move?

KateFriedman-NOAA commented 2 weeks ago

@KateFriedman-NOAA , can you add /lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs to the list of data to move?

@InnocentSouopgui-NOAA What do you need from that location? The fix were under there, I moved them to here on /lfs5: /lfs5/HFIP/hfv3gfs/glopara/FIX/fix

Last night I copied /lfs4/HFIP/hfv3gfs/glopara/git to here /lfs5/HFIP/hfv3gfs/glopara/git_lfs4. Just for safe keeping while we sort out the new space, I plan to remove this folder when done.

InnocentSouopgui-NOAA commented 2 weeks ago

Thank you @KateFriedman-NOAA . I missed the fact that you already relocated fix to /lfs5/HFIP/hfv3gfs/glopara/FIX/fix It is just what I needed. Thanks again.

InnocentSouopgui-NOAA commented 1 week ago

I am having a problem with cleanup jobs after a few cycles. after 24 hours (4 cycles of ENKF), all cleanup jobs start failing with the following message:

+ exglobal_cleanup.sh[46]: find_exclude_string+=' -name *prepbufr* -or -name *prepbufr* -or -name *cnvstat* -or -name *atmanl.nc -or'
+ exglobal_cleanup.sh[49]: find_exclude_string=' -name *prepbufr* -or -name *prepbufr* -or -name *cnvstat* -or -name *prepbufr* -or -name *prepbufr* -or -name *cnvstat* -or -name *atmanl.nc '
+ exglobal_cleanup.sh[52]: find /lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/com/test46/gdas.20211222/18 -type f -not '(' -name '*prepbufr*' -or -name '*prepbufr*' -or -name '*cnvstat*' -or -name '*prepbufr*' -or -name '*prepbufr*' -or -name '*cnvstat*' -or -name '*atmanl.nc' ')' -delete
find: Failed to save initial working directory: No such file or directory
+ exglobal_cleanup.sh[1]: postamble exglobal_cleanup.sh 1725475572 1
+ preamble.sh[70]: set +x
End exglobal_cleanup.sh at 18:46:15 with error code 1 (time elapsed: 00:00:03)
+ JGLOBAL_CLEANUP[1]: postamble JGLOBAL_CLEANUP 1725475561 1
+ preamble.sh[70]: set +x
End JGLOBAL_CLEANUP at 18:46:15 with error code 1 (time elapsed: 00:00:14)
+ cleanup.sh[1]: postamble cleanup.sh 1725475554 1
+ preamble.sh[70]: set +x
End cleanup.sh at 18:46:15 with error code 1 (time elapsed: 00:00:21)
DavidHuber-NOAA commented 1 week ago

I'm working on a fix for this. PR coming shortly.

InnocentSouopgui-NOAA commented 1 week ago

I'm working on a fix for this. PR coming shortly.

So can we ignore the problem for now, and move on with other testing in the migration?

DavidHuber-NOAA commented 1 week ago

Yes, I think so.

DavidHuber-NOAA commented 1 week ago

PR open: https://github.com/NOAA-EMC/global-workflow/pull/2893

InnocentSouopgui-NOAA commented 1 week ago

What should we do of verif-global? It still depends on hpc-stack.

@malloryprow Also it looks for data from spaces. Can you move those data to /lfs5 ? especially:

I am opening an issue on verif_gloal.

DavidHuber-NOAA commented 1 week ago

The statistics generated by verif-global during the execution of the global-workflow should run without loading hpc-stack modules. If that's not the case for Jet, then something is wrong.

However, the standalone mode still references those modules. The plan is to update verif-global after the installation of spack-stack v1.8.0. Until then, standalone mode requires some sort of manual intervention. This is true on almost all platforms (except S4, IIRC).

InnocentSouopgui-NOAA commented 1 week ago

@DavidHuber-NOAA it was very suspicious to me ask well, the whole global workflow ran without problem for more than 24 hours, at resolution C96/48 and C192/96. At resolution C384/192, it ran the first 00Z cycle, and failed on the second 00Z cycle. The failing task is gfsmetpg2o1. That is what prompted me to look around and found the warnings for missing files.

I can't figured out while the task gfsmetpg2o1 failed in the fist place. When you have a minute, you can check it out at /lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/expe/test46 /lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/com/test46

malloryprow commented 1 week ago

What should we do of verif-global? It still depends on hpc-stack.

@malloryprow Also it looks for data from spaces. Can you move those data to /lfs5 ? especially:

  • /lfs4/HFIP/hfv3gfs/Mallory.Row/archive
  • /lfs4/HFIP/hfv3gfs/Mallory.Row/prepbufr
  • /lfs4/HFIP/hfv3gfs/Mallory.Row/obdata/ccpa_accum24hr

I am opening an issue on verif_gloal.

I copied over the data.

InnocentSouopgui-NOAA commented 3 days ago

@KateFriedman-NOAA , don't forget about TC_Tracker, we don't have it yet. I am using a personal version for all the tests.

KateFriedman-NOAA commented 2 days ago

@InnocentSouopgui-NOAA Please see the email thread with the tracker folks. I installed a copy of @HananehJafary-NOAA 's branch here on Jet for testing: /lfs5/HFIP/hfv3gfs/glopara/git/TC_tracker/test_tracker