NOAA-PSL / land-offline_workflow

Creative Commons Zero v1.0 Universal
1 stars 12 forks source link

Release v1.beta2 #25

Closed mark-a-potts closed 1 year ago

mark-a-potts commented 1 year ago

This PR includes updates to DA_update and is mostly designed to make the system more portable.

Currently, the submit_cycle.sh script is set to use settings_cycle_test and is expecting input data to be in the "inputs" directory that is in the parent directory of land-offline_workflow. Ultimately, I would like to separate the two, but this is an attempt to get everyone working from a common point and get testing underway.

This has been tested with 2016 data and can be run as follows--

create a test directory that will be LANDDAROOT

cd land-test

clone the feature branch recursively

git clone -b feature/release-v1.beta2 --recursive https://github.com/NOAA-EPIC/land-offline_workflow.git

download the data

wget https://epic-sandbox-srw.s3.amazonaws.com/landda-data-2016.tar.gz

untar the data

tar xvfz landda-data-2016.tar.gz

cd into the workflow directory

cd land-offline_workflow/

load the module files

module use $PWD/modulefiles module load landda_orion.intel (change to hera for Hera)

create a build directory and cd into it

mkdir build cd build

run the ecbuild command to configure

ecbuild ..

build everything

make -j 8

back up to the main directory

cd ..

submit a sample job that will run two days

sbatch submit_sample_DA_cycle_test.sh

check on the status

squeue -u $USER

watch the output go by

tail -f log err

need to hit control-c to exit tail, but then you can check for the background and analysis files in the cycle_land directory

ls -l ../cycle_land/DA_GHCN_test/mem000/restarts/vector/

TO DO:

Set up a test to verify that the current version is working correctly and that future changes don't break anything. Separate out Inputs from LANDDAROOT variable. Test on 2020/2021 data. Organize environment variables so that they are all set with reasonable defaults in either DA sections or land driver sections and can be overridden Add a "check_environment.sh" script that will check to see if everything needed to run is in place.

ulmononian commented 1 year ago

following this verbatim, failure on orion after this call:

/work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/DA_update//do_landDA_release.sh /work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/settings_DA_test. point of failure is here:

+ mkdir /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi mkdir: cannot create directory ‘/work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi’: No such file or directory

err log: /work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/log_noahmp.8925530.log

To get past this issue, line 80 of https://github.com/NOAA-EPIC/land-DA_update/blob/feature/release-v1.beta2/do_landDA_release.sh can be changed to mkdir -p $JEDIWORKDIR

ulmononian commented 1 year ago

suggested alteration above gets past that issue, but now hitting:

`+ cp /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi/20160101.180000.coupler.res /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi/mem_neg/20160101.180000.coupler.res

modules loaded are from orion.modules, namely:

Screen Shot 2023-02-09 at 10 00 53 AM

stack-python/4.12.0 actually points to /apps/miniconda-4.12.0/bin/python, which does not have numpy. since we can't modify a core python install, why are we not pointing to the epic miniconda with module use /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles module load miniconda3/4.12.0 (which actually also does not have numpy but we can install it there). if it is problematic for us to use the spack stack env built using /apps/miniconda-4.12.0/bin/python but try to run with EPIC's miniconda, we can re-install the stack, but we might be fine without doing so.

ulmononian commented 1 year ago

i also had to install netcdf4 in the role.epic miniconda3 (see you have this as a to-do) and adjusted the orion.module file. with the fix to line 80 of do_landDA_release.sh and these python updates, the restarts were succesfully generated in /work2/noaa/epic-ps/cbook/land-test/cycle_land/DA_GHCN_test/mem000/restarts/vector.

i still think it is confusing that there exists both workdir and outputs in $LANDDAROOT, primarily because outputs then contains another folder called workdir which is empty (at least from my run).

the whole test i did is here: /work2/noaa/epic-ps/cbook/land-test

ClaraDraper-NOAA commented 1 year ago

@mark-a-potts I am dealing with an urgent issue for our renalysis replay, and am unlikely to get to this before Monday.

barlage commented 1 year ago

I'm attempting concurrent tests on orion and hera.

hera: there is no hera.modules

orion: it seems there is a permission issue for me when loading modules

[orion:/work/noaa/stmp/mbarlage/epic/land-test/land-offline_workflow]$ source orion.modules 
Lmod has detected the following error:  Unable to load module because of error when
evaluating modulefile:

/work/noaa/epic-ps/role-epic-ps/spack-stack/envs/landda-release-1.0-intel/install/modulefiles/intel/2022.0.2/stack-python/3.9.12.lua:
Empty or non-existant file
[orion:/work/noaa/stmp/mbarlage/epic/land-test/land-offline_workflow]$ more /work/noaa/epic-ps/role-epic-ps/spack-stack/envs/landda-release-1.0-intel/install/modulefiles/intel/2022.0.2/stack-python/3.9.12.lua
/work/noaa/epic-ps/role-epic-ps/spack-stack/envs/landda-release-1.0-intel/install/modulefiles/intel/2022.0.2/stack-python/3.9.12.lua: Permission denied
mark-a-potts commented 1 year ago

I added in the hera.modules (forgot to add it to my commit earlier) and I chmod'ed all the modules under the role account on Orion, so you should be able to use the orion modules now. Note that they are not the final version for the modules and you need to "source orion.modules" to load them for now. We'll convert them to lua soon.

barlage commented 1 year ago

I'm getting netcdf4 errors on both hera and orion, I assume similar to @ulmononian , e.g.:

++ python -c 'import netCDF4'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'netCDF4'
+ [[ 1 != 0 ]]
+ echo 'no netcdf4, trying to install'

Re-ran both again from fresh clones, so logs have changed (same results).

hera: /scratch2/NCEPDEV/land/Michael.Barlage/epic/land-test/land-offline_workflow/err_noahmp.41870402.err orion: /work/noaa/stmp/mbarlage/epic/land-test/land-offline_workflow/err_noahmp.8942940.err

mark-a-potts commented 1 year ago

Have you pulled the latest pushes to the repo? There should be a stanza at the beginning of submit_cycle.sh that at least tries to install netCDF4, though maybe it doesn't work from a compute node...

The stanza looks like this--

python -c 'import netCDF4' if [[ $? != 0 ]]; then echo 'no netcdf4, trying to install' python -m pip install netCDF4 python -c 'import netCDF4' if [[ $? != 0 ]]; then echo "could not install netCDF4 automatically. Please add netCDF4 module manually and re-run" exit fi fi

barlage commented 1 year ago

I didn't add the full error log but you can see that it tries but fails (the hera run eventually times out I think).

+ python -m pip install netCDF4
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630d2130>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630d2d30>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630d2ee0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630f60d0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630f6280>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
ERROR: Could not find a version that satisfies the requirement netCDF4 (from versions: none)
ERROR: No matching distribution found for netCDF4
mark-a-potts commented 1 year ago

Okay. I'll have to re-think that. It should be part of the python package that gets loaded to begin with, really.

ulmononian commented 1 year ago

Okay. I'll have to re-think that. It should be part of the python package that gets loaded to begin with, really.

@mark-a-potts i will say this again here (i mentioned this in a previous comment): why not just load the miniconda from the role.epic and ensure that netcdf4 is installed there? (i already did this on orion and resolved the issue there; it only stands to be done on hera). since your module file loads "stack-python", it is loading ` on orion but already loads/scratch1/NCEPDEV/nems/role.epic/miniconda3/4.12.0/bin/python` on hera since that stack was built referencing EPIC's miniconda.

not sure if a good idea to be installing netcdf into the python that a user is trying to use, assuming they in theory have modified the modules loaded. further, i don't believe we can modify the stack-python currently loaded on orion (/apps/miniconda-4.12.0/bin/python) so a pip install would not work there either.

ulmononian commented 1 year ago

hera python updated to include netcdf4. @barlage please give it a shot now :) you can always check before you run the job w/ python -c "import netCDF4" but it did import for me.

barlage commented 1 year ago

@mark-a-potts @ulmononian I believe that I just successfully ran on hera. I needed to make one change in do_landDA_release.sh, adding -p to line 75:

mkdir -p $JEDIWORKDIR

otherwise I think everything was out-of-the-box. When this mkdir change gets in, I will have someone else locally test on hera.

ulmononian commented 1 year ago

@mark-a-potts @ulmononian I believe that I just successfully ran on hera. I needed to make one change in do_landDA_release.sh, adding -p to line 75:

mkdir -p $JEDIWORKDIR

otherwise I think everything was out-of-the-box. When this mkdir change gets in, I will have someone else locally test on hera.

@barlage i had the same issue in my testing yesterday on orion (see my first comment here) but could not suggest the change in my review since the script is part of DA_update. had the same issue when i built/ran with the newest commits today. had to add the -p when creating the $JEDIWORKDIR.

ulmononian commented 1 year ago

linking issues are back (see log here /work2/noaa/epic-ps/cbook/land-test-2/land-offline_workflow/log_noahmp.8944368.log). i noticed find_package( fv3-bundle REQUIRED) has been commented out of the land-offline_workflow CMakeLists.txt, so guessing that is why it can't find ${fv3-bundle_BASE_DIR} in line 52 (https://github.com/NOAA-EPIC/land-offline_workflow/blob/f0141f35469d523d6f86b61b7623affb0c63f9dc/CMakeLists.txt#L53).

@mark-a-potts i know you mentioned we may have to cp the files rather than softlink at this point, but i will just throw my solution out there one more time: https://github.com/NOAA-EPIC/land-DA_update/pull/1. just run the links upon the first cycle run and does not link again if they exist.

mark-a-potts commented 1 year ago

Ah yes, I forgot I changed that. You need to run ecbuild with this command (after setting EPICHOME correctly for your platform) now--

ecbuild -Dfv3-bundle_BASE_DIR=$EPICHOME/contrib/fv3-bundle ..

We should probably switch this around to use an environment variable that is set in the modulefile for the platform and gets picked up by CMake.

-M

On 2/10/23 12:40 PM, Cameron Book wrote:

linking issues are back (see log here /work2/noaa/epic-ps/cbook/land-test-2/land-offline_workflow/log_noahmp.8944368.log). i noticed |find_package( fv3-bundle REQUIRED)| has been commented out of the land-offline_workflow CMakeLists.txt, so guessing that is why it can't find |${fv3-bundle_BASE_DIR}| in line 52 (https://github.com/NOAA-EPIC/land-offline_workflow/blob/f0141f35469d523d6f86b61b7623affb0c63f9dc/CMakeLists.txt#L53).

@mark-a-potts https://github.com/mark-a-potts i know you mentioned we may have to cp the files rather than softlink at this point, but i will just throw my solution out there one more time: NOAA-EPIC/land-DA_update#1 https://github.com/NOAA-EPIC/land-DA_update/pull/1. just run the links upon the first cycle run and does not link again if they exist.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-PSL/land-offline_workflow/pull/25#issuecomment-1426132246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UXGPXZXNCN46WPJHJTWWZ4PPANCNFSM6AAAAAAUW2EMLE. You are receiving this because you were mentioned.Message ID: @.***>

--

Mark A. Potts, Ph.D. NOAA EPIC Lead Software Engineer RedLine Performance Solutions, LLC Phone 202-744-9469 @. @.

mark-a-potts commented 1 year ago

I just pushed a change to DA_update that changes that "python" to "${PYTHON}" which should work (hopefully). The change also added in the "-p" to the mkdir calls.

-M

On 2/9/23 1:01 PM, Cameron Book wrote:

suggested alteration above gets past that issue, but now hitting:

`+ cp /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi/20160101.180000.coupler.res /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi/mem_neg/20160101.180000.coupler.res

  • echo 'do_landDA: calling create ensemble'
  • python /work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/DA_update//letkf_create_ens.py 20160101.180000 snwdph 30 Traceback (most recent call last): File "/work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/DA_update//letkf_create_ens.py", line 1, in import numpy as np ModuleNotFoundError: No module named 'numpy'
  • [[ 1 != 0 ]]
  • echo 'letkf create failed'
  • exit 10
  • [[ 10 != 0 ]]
  • echo 'land DA script failed'
  • exit ~`

modules loaded are from orion.modules, namely:

Screen Shot 2023-02-09 at 10 00 53 AM https://user-images.githubusercontent.com/43379611/217898834-bbc41841-1e5f-4d36-8791-fbca01d60189.png

— Reply to this email directly, view it on GitHub https://github.com/NOAA-PSL/land-offline_workflow/pull/25#issuecomment-1424598647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2USV3FNFFPNSXDSLDCLWWUWGTANCNFSM6AAAAAAUW2EMLE. You are receiving this because you authored the thread.Message ID: @.***>

--

Mark A. Potts, Ph.D. NOAA EPIC Lead Software Engineer RedLine Performance Solutions, LLC Phone 202-744-9469 @. @.

barlage commented 1 year ago

@ulmononian I did see your earlier comment and assumed it was the same issue, but had lost my orion session and was too lazy to log back in to confirm that your line 80 was my line 75.

ulmononian commented 1 year ago

@ulmononian I did see your earlier comment and assumed it was the same issue, but had lost my orion session and was too lazy to log back in to confirm that your line 80 was my line 75.

lol. just glad you were able to reproduce the error to confirm not the usual user error on my side. @mark-a-potts thanks for pushing those changes!!

mark-a-potts commented 1 year ago

This morning, I took the monolithic submit_sample_DA_cycle_test.sh and separated it into a submit_cycle.sh and do_submit_cycle.sh that should track largely with what you were doing before. I removed the setup of the directories from do_submit_cycle.sh since that is now being done in submit_cycle.sh and there are a couple of other changes with regards to restart files which may need to be changed. If these changes look reasonable, it would be really helpful to approve them so that we can get the release branch cut and all work from that. There are several other PRs that are all waiting to be done and we can make changes to the release branch according to what you would like to see there.

-M

On 2/15/23 11:30 AM, ClaraDraper-NOAA wrote:

@.**** commented on this pull request.


In submit_cycle.sh https://github.com/NOAA-PSL/land-offline_workflow/pull/25#discussion_r1107361521:

THISDATE=$STARTDATE date_count=0

+vec2tileexec=${BUILDDIR}/bin/vector2tile_converter.exe

The intention is that jobs are submitted using do_submit_cycle.sh, which stages the necessary files, sets any variables needed (basically does everything that needs to be done once per experiment). It then calls submit_cycle.sh, which cycles through the forecasts and DA, and will re-submit itself if need be (i.e., if we want to submit a number of smaller jobs, rather than one big job). At the moment, the PR has changes to do_submit_cycle.sh and submit_cycle.sh, as well as the addition of new submission scripts. Are you still using do_submit_cycle.sh and submit_cycle.sh, or is your intention that the new scripts replace those? we shouldn't really need separate submit scripts for different jobs - I only created the test one for convenience.

The changes to the scripts are much more extensive than I was expecting ( in a separate project, we have altered my scripts to run on the cloud using containers, by creating new settings files, changing the executable calls, and making minimal other changes, as needed). I am almost certain that the initial functionality of do_submit_cycle.sh and submit_cycle.sh on hera has not been retained. It is going to be very difficult / a lot of work for me to merge these changes into my develop branch. Either we need to revert this PR to follow the original design of the scripts, or I suggest I just merge it without really looking at it with the understanding that this work will not make it back into develop (unless it's fixed up later).

— Reply to this email directly, view it on GitHub https://github.com/NOAA-PSL/land-offline_workflow/pull/25#discussion_r1107361521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UVX744VKOV75GILYXDWXUABBANCNFSM6AAAAAAUW2EMLE. You are receiving this because you were mentioned.Message ID: @.***>

--

Mark A. Potts, Ph.D. NOAA EPIC Lead Software Engineer RedLine Performance Solutions, LLC Phone 202-744-9469 @. @.

ClaraDraper-NOAA commented 1 year ago

This morning, I took the monolithic submit_sample_DA_cycle_test.sh and separated it into a submit_cycle.sh and do_submit_cycle.sh that should track largely with what you were doing before. I removed the setup of the directories from do_submit_cycle.sh since that is now being done in submit_cycle.sh and there are a couple of other changes with regards to restart files which may need to be changed. If these changes look reasonable, it would be really helpful to approve them so that we can get the release branch cut and all work from that. There are several other PRs that are all waiting to be done and we can make changes to the release branch according to what you would like to see there. -M On 2/15/23 11:30 AM, ClaraDraper-NOAA wrote: @.** commented on this pull request. ------------------------------------------------------------------------ In submit_cycle.sh <#25 (comment)>: > THISDATE=$STARTDATE date_count=0 +vec2tileexec=${BUILDDIR}/bin/vector2tile_converter.exe The intention is that jobs are submitted using do_submit_cycle.sh, which stages the necessary files, sets any variables needed (basically does everything that needs to be done once per experiment). It then calls submit_cycle.sh, which cycles through the forecasts and DA, and will re-submit itself if need be (i.e., if we want to submit a number of smaller jobs, rather than one big job). At the moment, the PR has changes to do_submit_cycle.sh and submit_cycle.sh, as well as the addition of new submission scripts. Are you still using do_submit_cycle.sh and submit_cycle.sh, or is your intention that the new scripts replace those? we shouldn't really need separate submit scripts for different jobs - I only created the test one for convenience. The changes to the scripts are much more extensive than I was expecting ( in a separate project, we have altered my scripts to run on the cloud using containers, by creating new settings files, changing the executable calls, and making minimal other changes, as needed). I am almost certain that the initial functionality of do_submit_cycle.sh and submit_cycle.sh on hera has not been retained. It is going to be very difficult / a lot of work for me to merge these changes into my develop branch. Either we need to revert this PR to follow the original design of the scripts, or I suggest I just merge it without really looking at it with the understanding that this work will not make it back into develop (unless it's fixed up later). — Reply to this email directly, view it on GitHub <#25 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UVX744VKOV75GILYXDWXUABBANCNFSM6AAAAAAUW2EMLE. You are receiving this because you were mentioned.Message ID: **@.> -- -- Mark A. Potts, Ph.D. NOAA EPIC Lead Software Engineer RedLine Performance Solutions, LLC Phone 202-744-9469 @. @.***

Thanks Mark. What's the reason for moving the directory set-up into submit_cycle.sh? Let me know when you've pushed something that you want me to look at.

mark-a-potts commented 1 year ago

@ClaraDraper-NOAA The changes are in place now. The do_submit_cycle.sh script defaults to using settings_DA_cycle_gdas, but will also work with settings_DA_cycle_era5 (using 2020/21 data). Any other settings files may need to be tweaked, but haven't been tested.

ClaraDraper-NOAA commented 1 year ago

What's the reason for moving the directory set-up into submit_cycle.sh?

mark-a-potts commented 1 year ago

Subdirectories were being created in submit_cycle.sh as well as top level in do_submit_cycle.sh, so the creation was all consolidated.

ClaraDraper-NOAA commented 1 year ago

Subdirectories were being created in submit_cycle.sh as well as top level in do_submit_cycle.sh, so the creation was all consolidated.

In my original submit_cycle.sh there are no directories created. I will do a formal review later today, but unless there's a reason it needs to be moved, I going to ask you to move all of the directory creation back to do_submit_cycle.sh. I'm trying to minimize the code changes, to make it easier to merge.