Open KateFriedman-NOAA opened 4 weeks ago
@InnocentSouopgui-NOAA Would you be able to help with this after spack-stack has been installed on /lfs5
?
The spack-stack installation is being tracked here: https://github.com/JCSDA/spack-stack/issues/1250
Sure, I will do that. I am already tracking the migration of spack-stack on Jet.
Thanks @InnocentSouopgui-NOAA ! I can take care of Fit2Obs, obsproc, and prepobs. I can probably help with other components too after those are done. I am already working on unrelated updates to obsproc and prepobs so I'll fold the Jet updates into those efforts.
A new spack-stack/1.6.0 install is now available under /contrib
on Jet (equivalent to the gsi-addon-dev
env we had before): /contrib/spack-stack/spack-stack-1.6.0/envs/gsi-addon-intel/install/modulefiles/Core
Thanks @InnocentSouopgui-NOAA ! I can take care of Fit2Obs, obsproc, and prepobs. I can probably help with other components too after those are done. I am already working on unrelated updates to obsproc and prepobs so I'll fold the Jet updates into those efforts.
@KateFriedman-NOAA, where are you with the external dependencies? I built all the other components (that get build with build_all.sh scripts inside sorc) of Global Workflow, and want to start testing the cycling.
@InnocentSouopgui-NOAA Fit2Obs is done and installed on Jet here (note the new v1.1.3
version): /lfs5/HFIP/hfv3gfs/glopara/git/Fit2Obs/v1.1.3
Obsproc is in review (see https://github.com/NOAA-EMC/obsproc/pull/92). We'll be going to v1.2 with this. I will let you know when it is installed on Jet.
I am planning to work on prepobs today and combine the work with our move to the new v1.1.0 version that went into ops. Will also install this on Jet when ready and inform you.
@InnocentSouopgui-NOAA Updated obsproc/v1.2 is now installed on Jet: /lfs5/HFIP/hfv3gfs/glopara/git/obsproc/v1.2.0
I will have a PR shortly that will update develop
to v1.2 everywhere so you can either ingest that from develop
into a branch you're using or update obsproc_run_ver=1.2.0
in run.spack
ver` in advance.
From Jet admins:
Jet /lfs4 migration to /lfs5
Due to ongoing issues with /lfs4 we request that all users migrate their active data from /lfs4 to /lfs5
before next Wednesday 9/4. All projects that have quota on /lfs4 have been given quota on /lfs5.
Weather permitting, on Wednesday 9/4 we plan to have an /lfs4 test outage from 1000 to ~1600 MT
to verify all /lfs4 dependences have been removed. If this test outage is successful we plan to make
/lfs4 read only for 2 more weeks, then unmounting it ~9/17.
@InnocentSouopgui-NOAA Fit2Obs, obsproc, and prepobs are now ready and installed on Jet. See the checklist in the main issue comment for paths. You'll need to update fit2obs_ver=1.1.3
, obsproc_run_ver=1.2.0
, and prepobs_run_ver=1.1.0
in versions/run.spack.ver
to use them. Also update BASE_GIT
in workflow/hosts/jet.yaml
to be /lfs5/HFIP/hfv3gfs/glopara/git
.
@InnocentSouopgui-NOAA Fit2Obs, obsproc, and prepobs are now ready and installed on Jet. See the checklist in the main issue comment for paths. You'll need to update
fit2obs_ver=1.1.3
,obsproc_run_ver=1.2.0
, andprepobs_run_ver=1.1.0
inversions/run.spack.ver
to use them. Also updateBASE_GIT
inworkflow/hosts/jet.yaml
to be/lfs5/HFIP/hfv3gfs/glopara/git
.
Thanks Kate, With this I will start testing the whole Global Workflow system.
@KateFriedman-NOAA, There is an environmental variable if GSI module file that references a space on /lfs4, see below. Are you in charge of that one as well?
pushenv("GSI_BINARY_SOURCE_DIR", "/mnt/lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix/gsi/20240208")
@KateFriedman-NOAA, There is an environmental variable if GSI module file that references a space on /lfs4, see below. Are you in charge of that one as well?
pushenv("GSI_BINARY_SOURCE_DIR", "/mnt/lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix/gsi/20240208")
If you're talking about the fix/gsi/20240208
folder then yes. That should now be /lfs5/HFIP/hfv3gfs/glopara/FIX/fix/gsi/20240208
. Ideally though that shouldn't be a hardcoded path. @RussTreadon-NOAA is there a way to make this not hardcoded?
@InnocentSouopgui-NOAA , please follow the change control procedure described under GSI: How to Make Changes to update the GSI_BINARY_SOURCE_DIR
path in modulefiles/gsi_jet.intel.lua
. (FYI, I am not the GSI code manager. We do not have a GSI code manager.)
@KateFriedman-NOAA , GSI_BINARY_SOURCE_DIR
was added when EIB split the gerrit GSI fix
into ASCII & Binary files. ASCII files are managed via a GSI-fix submodule hash included as GSI fix
. Binary fix files are managed in EIB space. Fortunately GSI binary fix files do not change frequently. I don't have a good suggestion as to how to remove a fixed path for GSI_BINARY_SOURCE_DIR
.
A softening in the hardcoded path approach would be to have GSI_BINARY_SOURCE_DIR
point at an EIB maintained link. The link points at the the most recent set of GSI binary fix files. For example, GSI modulefiles/gsi_hera.intel.lua
could be updated to read
pushenv("GSI_BINARY_SOURCE_DIR", "/scratch1/NCEPDEV/global/glopara/fix/gsi/latest")
where latest
is a soft link currently pointing at /scratch1/NCEPDEV/global/glopara/fix/gsi/20240208
.
This way EIB can change the directory at which soft link latest
points at without any changes to GSI_BINARY_SOURCE_DIR
.
The disadvantage of this approach is that is not readily apparent to GSI developers which snapshot of the GSI binary fix files they are using. Also, we still have a hardcoded path for GSI_BINARY_SOURCE_DIR
... it's just that we don't need to change the date string when GSI binary fix files are updated.
@RussTreadon-NOAA Thanks for the refresher on how GSI_BINARY_SOURCE_DIR
came to be what it is. Sounds like updating the hardcoded path to the new hardcoded path is the easiest thing right now. I'm not a fan of the latest
symlink for the disadvantage you outlined.
@KateFriedman-NOAA, a couple of things that require your attention
@KateFriedman-NOAA, just a though here. will it be better to have the external dependencies of Global Workflow on /contrib so that they are storage independent? That just crossed my mind.
@KateFriedman-NOAA other data that need to move.
in summary, I thin the whole /lfs4/HFIP/hfv3gfs/glopara
should be copied.
All the following set in workflow/hosts/jet.yaml
reference a subdirectory of /lfs4/HFIP/hfv3gfs/glopara
@InnocentSouopgui-NOAA Here is the status of the various glopara folders moving from /lfs4
to /lfs5
:
/lfs5/HFIP/hfv3gfs/glopara/dump
, it's quite large so it will take a few days if not more/lfs5/HFIP/hfv3gfs/glopara/data/ICSDIR
/lfs4/HFIP/hfv3gfs/glopara/com
) - now in place /lfs5/HFIP/hfv3gfs/glopara/com
You don't need to worry about the other COMINs for gempak, we don't run/support gempak outside of WCOSS2. Same goes for PACKAGEROOT/nwpara
folder, you can ignore that.
will it be better to have the external dependencies of Global Workflow on /contrib so that they are storage independent?
We don't have access to install on /contrib
. We're also considering making the external packages into submodules of g-w develop
so that would moot the need to install them.
please provide TC_tracker too
Will do, stay tuned...
@KateFriedman-NOAA , can you add /lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs
to the list of data to move?
@KateFriedman-NOAA , can you add
/lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs
to the list of data to move?
@InnocentSouopgui-NOAA What do you need from that location? The fix
were under there, I moved them to here on /lfs5
: /lfs5/HFIP/hfv3gfs/glopara/FIX/fix
Last night I copied /lfs4/HFIP/hfv3gfs/glopara/git
to here /lfs5/HFIP/hfv3gfs/glopara/git_lfs4
. Just for safe keeping while we sort out the new space, I plan to remove this folder when done.
Thank you @KateFriedman-NOAA . I missed the fact that you already relocated fix to /lfs5/HFIP/hfv3gfs/glopara/FIX/fix
It is just what I needed.
Thanks again.
I am having a problem with cleanup jobs after a few cycles. after 24 hours (4 cycles of ENKF), all cleanup jobs start failing with the following message:
+ exglobal_cleanup.sh[46]: find_exclude_string+=' -name *prepbufr* -or -name *prepbufr* -or -name *cnvstat* -or -name *atmanl.nc -or'
+ exglobal_cleanup.sh[49]: find_exclude_string=' -name *prepbufr* -or -name *prepbufr* -or -name *cnvstat* -or -name *prepbufr* -or -name *prepbufr* -or -name *cnvstat* -or -name *atmanl.nc '
+ exglobal_cleanup.sh[52]: find /lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/com/test46/gdas.20211222/18 -type f -not '(' -name '*prepbufr*' -or -name '*prepbufr*' -or -name '*cnvstat*' -or -name '*prepbufr*' -or -name '*prepbufr*' -or -name '*cnvstat*' -or -name '*atmanl.nc' ')' -delete
find: Failed to save initial working directory: No such file or directory
+ exglobal_cleanup.sh[1]: postamble exglobal_cleanup.sh 1725475572 1
+ preamble.sh[70]: set +x
End exglobal_cleanup.sh at 18:46:15 with error code 1 (time elapsed: 00:00:03)
+ JGLOBAL_CLEANUP[1]: postamble JGLOBAL_CLEANUP 1725475561 1
+ preamble.sh[70]: set +x
End JGLOBAL_CLEANUP at 18:46:15 with error code 1 (time elapsed: 00:00:14)
+ cleanup.sh[1]: postamble cleanup.sh 1725475554 1
+ preamble.sh[70]: set +x
End cleanup.sh at 18:46:15 with error code 1 (time elapsed: 00:00:21)
I'm working on a fix for this. PR coming shortly.
I'm working on a fix for this. PR coming shortly.
So can we ignore the problem for now, and move on with other testing in the migration?
Yes, I think so.
What should we do of verif-global? It still depends on hpc-stack.
@malloryprow Also it looks for data from spaces. Can you move those data to /lfs5 ? especially:
/lfs4/HFIP/hfv3gfs/Mallory.Row/archive
/lfs4/HFIP/hfv3gfs/Mallory.Row/prepbufr
/lfs4/HFIP/hfv3gfs/Mallory.Row/obdata/ccpa_accum24hr
I am opening an issue on verif_gloal.
The statistics generated by verif-global during the execution of the global-workflow should run without loading hpc-stack modules. If that's not the case for Jet, then something is wrong.
However, the standalone mode still references those modules. The plan is to update verif-global after the installation of spack-stack v1.8.0. Until then, standalone mode requires some sort of manual intervention. This is true on almost all platforms (except S4, IIRC).
@DavidHuber-NOAA
it was very suspicious to me ask well, the whole global workflow ran without problem for more than 24 hours, at resolution C96/48 and C192/96.
At resolution C384/192, it ran the first 00Z cycle, and failed on the second 00Z cycle. The failing task is gfsmetpg2o1
.
That is what prompted me to look around and found the warnings for missing files.
I can't figured out while the task gfsmetpg2o1
failed in the fist place. When you have a minute, you can check it out at
/lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/expe/test46
/lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/com/test46
What should we do of verif-global? It still depends on hpc-stack.
@malloryprow Also it looks for data from spaces. Can you move those data to /lfs5 ? especially:
/lfs4/HFIP/hfv3gfs/Mallory.Row/archive
/lfs4/HFIP/hfv3gfs/Mallory.Row/prepbufr
/lfs4/HFIP/hfv3gfs/Mallory.Row/obdata/ccpa_accum24hr
I am opening an issue on verif_gloal.
I copied over the data.
@KateFriedman-NOAA , don't forget about TC_Tracker, we don't have it yet. I am using a personal version for all the tests.
@InnocentSouopgui-NOAA Please see the email thread with the tracker folks. I installed a copy of @HananehJafary-NOAA 's branch here on Jet for testing: /lfs5/HFIP/hfv3gfs/glopara/git/TC_tracker/test_tracker
What new functionality do you need?
The /lfs4 filesystem has become unusable and users need to migrate to /lfs5. The global-workflow, libraries, and components (both internal and external) will need to be updated to use /lfs5.
What are the requirements for the new functionality?
The following need to be updated to use /lfs5:
/contrib/spack-stack/spack-stack-1.6.0/envs/gsi-addon-intel/install/modulefiles/Core
FIX_DIR
- relocated to/lfs5/HFIP/hfv3gfs/glopara/FIX/fix
The following need to be updated to use the migrated spack-stack install before global-workflow can be fully migrated:
obsproc.v1.2.0-rd-gfsv17
tag cut and installed everywhere (Jet:/lfs5/HFIP/hfv3gfs/glopara/git/obsproc/v1.2.0
)prepobs.v1.1.0-rd-gfsv17
tag cut and installed everywhere (Jet:/lfs5/HFIP/hfv3gfs/glopara/git/prepobs/v1.1.0
)v1.1.3
tag cut and installed everywhere (Jet:/lfs5/HFIP/hfv3gfs/glopara/git/Fit2Obs/v1.1.3
)Acceptance Criteria
All components build and run on /lfs5
Suggest a solution (optional)
No response