Closed mark-a-potts closed 1 year ago
following this verbatim, failure on orion after this call:
/work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/DA_update//do_landDA_release.sh /work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/settings_DA_test
. point of failure is here:
+ mkdir /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi mkdir: cannot create directory ‘/work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi’: No such file or directory
err log: /work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/log_noahmp.8925530.log
To get past this issue, line 80 of https://github.com/NOAA-EPIC/land-DA_update/blob/feature/release-v1.beta2/do_landDA_release.sh can be changed to mkdir -p $JEDIWORKDIR
suggested alteration above gets past that issue, but now hitting:
`+ cp /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi/20160101.180000.coupler.res /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi/mem_neg/20160101.180000.coupler.res
modules loaded are from orion.modules, namely:
stack-python/4.12.0 actually points to /apps/miniconda-4.12.0/bin/python, which does not have numpy. since we can't modify a core python install, why are we not pointing to the epic miniconda with module use /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles module load miniconda3/4.12.0
(which actually also does not have numpy but we can install it there). if it is problematic for us to use the spack stack env built using /apps/miniconda-4.12.0/bin/python but try to run with EPIC's miniconda, we can re-install the stack, but we might be fine without doing so.
i also had to install netcdf4 in the role.epic miniconda3 (see you have this as a to-do) and adjusted the orion.module file. with the fix to line 80 of do_landDA_release.sh
and these python updates, the restarts were succesfully generated in /work2/noaa/epic-ps/cbook/land-test/cycle_land/DA_GHCN_test/mem000/restarts/vector
.
i still think it is confusing that there exists both workdir
and outputs
in $LANDDAROOT, primarily because outputs
then contains another folder called workdir
which is empty (at least from my run).
the whole test i did is here: /work2/noaa/epic-ps/cbook/land-test
@mark-a-potts I am dealing with an urgent issue for our renalysis replay, and am unlikely to get to this before Monday.
I'm attempting concurrent tests on orion and hera.
hera: there is no hera.modules
orion: it seems there is a permission issue for me when loading modules
[orion:/work/noaa/stmp/mbarlage/epic/land-test/land-offline_workflow]$ source orion.modules
Lmod has detected the following error: Unable to load module because of error when
evaluating modulefile:
/work/noaa/epic-ps/role-epic-ps/spack-stack/envs/landda-release-1.0-intel/install/modulefiles/intel/2022.0.2/stack-python/3.9.12.lua:
Empty or non-existant file
[orion:/work/noaa/stmp/mbarlage/epic/land-test/land-offline_workflow]$ more /work/noaa/epic-ps/role-epic-ps/spack-stack/envs/landda-release-1.0-intel/install/modulefiles/intel/2022.0.2/stack-python/3.9.12.lua
/work/noaa/epic-ps/role-epic-ps/spack-stack/envs/landda-release-1.0-intel/install/modulefiles/intel/2022.0.2/stack-python/3.9.12.lua: Permission denied
I added in the hera.modules (forgot to add it to my commit earlier) and I chmod'ed all the modules under the role account on Orion, so you should be able to use the orion modules now. Note that they are not the final version for the modules and you need to "source orion.modules" to load them for now. We'll convert them to lua soon.
I'm getting netcdf4 errors on both hera and orion, I assume similar to @ulmononian , e.g.:
++ python -c 'import netCDF4'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'netCDF4'
+ [[ 1 != 0 ]]
+ echo 'no netcdf4, trying to install'
Re-ran both again from fresh clones, so logs have changed (same results).
hera: /scratch2/NCEPDEV/land/Michael.Barlage/epic/land-test/land-offline_workflow/err_noahmp.41870402.err orion: /work/noaa/stmp/mbarlage/epic/land-test/land-offline_workflow/err_noahmp.8942940.err
Have you pulled the latest pushes to the repo? There should be a stanza at the beginning of submit_cycle.sh that at least tries to install netCDF4, though maybe it doesn't work from a compute node...
The stanza looks like this--
python -c 'import netCDF4'
if [[ $? != 0 ]]; then
echo 'no netcdf4, trying to install'
python -m pip install netCDF4
python -c 'import netCDF4'
if [[ $? != 0 ]]; then
echo "could not install netCDF4 automatically. Please add netCDF4 module manually and re-run"
exit
fi
fi
I didn't add the full error log but you can see that it tries but fails (the hera run eventually times out I think).
+ python -m pip install netCDF4
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630d2130>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630d2d30>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630d2ee0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630f60d0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x2b60630f6280>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/netcdf4/
ERROR: Could not find a version that satisfies the requirement netCDF4 (from versions: none)
ERROR: No matching distribution found for netCDF4
Okay. I'll have to re-think that. It should be part of the python package that gets loaded to begin with, really.
Okay. I'll have to re-think that. It should be part of the python package that gets loaded to begin with, really.
@mark-a-potts i will say this again here (i mentioned this in a previous comment): why not just load the miniconda from the role.epic and ensure that netcdf4 is installed there? (i already did this on orion and resolved the issue there; it only stands to be done on hera). since your module file loads "stack-python", it is loading ` on orion but already loads
/scratch1/NCEPDEV/nems/role.epic/miniconda3/4.12.0/bin/python` on hera since that stack was built referencing EPIC's miniconda.
not sure if a good idea to be installing netcdf into the python that a user is trying to use, assuming they in theory have modified the modules loaded. further, i don't believe we can modify the stack-python currently loaded on orion (/apps/miniconda-4.12.0/bin/python) so a pip install would not work there either.
hera python updated to include netcdf4. @barlage please give it a shot now :) you can always check before you run the job w/ python -c "import netCDF4"
but it did import for me.
@mark-a-potts @ulmononian I believe that I just successfully ran on hera. I needed to make one change in do_landDA_release.sh
, adding -p
to line 75:
mkdir -p $JEDIWORKDIR
otherwise I think everything was out-of-the-box. When this mkdir change gets in, I will have someone else locally test on hera.
@mark-a-potts @ulmononian I believe that I just successfully ran on hera. I needed to make one change in
do_landDA_release.sh
, adding-p
to line 75:
mkdir -p $JEDIWORKDIR
otherwise I think everything was out-of-the-box. When this mkdir change gets in, I will have someone else locally test on hera.
@barlage i had the same issue in my testing yesterday on orion (see my first comment here) but could not suggest the change in my review since the script is part of DA_update. had the same issue when i built/ran with the newest commits today. had to add the -p
when creating the $JEDIWORKDIR
.
linking issues are back (see log here /work2/noaa/epic-ps/cbook/land-test-2/land-offline_workflow/log_noahmp.8944368.log). i noticed find_package( fv3-bundle REQUIRED)
has been commented out of the land-offline_workflow CMakeLists.txt, so guessing that is why it can't find ${fv3-bundle_BASE_DIR}
in line 52 (https://github.com/NOAA-EPIC/land-offline_workflow/blob/f0141f35469d523d6f86b61b7623affb0c63f9dc/CMakeLists.txt#L53).
@mark-a-potts i know you mentioned we may have to cp the files rather than softlink at this point, but i will just throw my solution out there one more time: https://github.com/NOAA-EPIC/land-DA_update/pull/1. just run the links upon the first cycle run and does not link again if they exist.
Ah yes, I forgot I changed that. You need to run ecbuild with this command (after setting EPICHOME correctly for your platform) now--
ecbuild -Dfv3-bundle_BASE_DIR=$EPICHOME/contrib/fv3-bundle ..
We should probably switch this around to use an environment variable that is set in the modulefile for the platform and gets picked up by CMake.
-M
On 2/10/23 12:40 PM, Cameron Book wrote:
linking issues are back (see log here /work2/noaa/epic-ps/cbook/land-test-2/land-offline_workflow/log_noahmp.8944368.log). i noticed |find_package( fv3-bundle REQUIRED)| has been commented out of the land-offline_workflow CMakeLists.txt, so guessing that is why it can't find |${fv3-bundle_BASE_DIR}| in line 52 (https://github.com/NOAA-EPIC/land-offline_workflow/blob/f0141f35469d523d6f86b61b7623affb0c63f9dc/CMakeLists.txt#L53).
@mark-a-potts https://github.com/mark-a-potts i know you mentioned we may have to cp the files rather than softlink at this point, but i will just throw my solution out there one more time: NOAA-EPIC/land-DA_update#1 https://github.com/NOAA-EPIC/land-DA_update/pull/1. just run the links upon the first cycle run and does not link again if they exist.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-PSL/land-offline_workflow/pull/25#issuecomment-1426132246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UXGPXZXNCN46WPJHJTWWZ4PPANCNFSM6AAAAAAUW2EMLE. You are receiving this because you were mentioned.Message ID: @.***>
--
Mark A. Potts, Ph.D. NOAA EPIC Lead Software Engineer RedLine Performance Solutions, LLC Phone 202-744-9469 @. @.
I just pushed a change to DA_update that changes that "python" to "${PYTHON}" which should work (hopefully). The change also added in the "-p" to the mkdir calls.
-M
On 2/9/23 1:01 PM, Cameron Book wrote:
suggested alteration above gets past that issue, but now hitting:
`+ cp /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi/20160101.180000.coupler.res /work2/noaa/epic-ps/cbook/land-test/outputs/workdir/jedi/mem_neg/20160101.180000.coupler.res
- echo 'do_landDA: calling create ensemble'
- python /work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/DA_update//letkf_create_ens.py 20160101.180000 snwdph 30 Traceback (most recent call last): File "/work2/noaa/epic-ps/cbook/land-test/land-offline_workflow/DA_update//letkf_create_ens.py", line 1, in import numpy as np ModuleNotFoundError: No module named 'numpy'
- [[ 1 != 0 ]]
- echo 'letkf create failed'
- exit 10
- [[ 10 != 0 ]]
- echo 'land DA script failed'
- exit ~`
modules loaded are from orion.modules, namely:
Screen Shot 2023-02-09 at 10 00 53 AM https://user-images.githubusercontent.com/43379611/217898834-bbc41841-1e5f-4d36-8791-fbca01d60189.png
— Reply to this email directly, view it on GitHub https://github.com/NOAA-PSL/land-offline_workflow/pull/25#issuecomment-1424598647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2USV3FNFFPNSXDSLDCLWWUWGTANCNFSM6AAAAAAUW2EMLE. You are receiving this because you authored the thread.Message ID: @.***>
--
Mark A. Potts, Ph.D. NOAA EPIC Lead Software Engineer RedLine Performance Solutions, LLC Phone 202-744-9469 @. @.
@ulmononian I did see your earlier comment and assumed it was the same issue, but had lost my orion session and was too lazy to log back in to confirm that your line 80 was my line 75.
@ulmononian I did see your earlier comment and assumed it was the same issue, but had lost my orion session and was too lazy to log back in to confirm that your line 80 was my line 75.
lol. just glad you were able to reproduce the error to confirm not the usual user error on my side. @mark-a-potts thanks for pushing those changes!!
This morning, I took the monolithic submit_sample_DA_cycle_test.sh and separated it into a submit_cycle.sh and do_submit_cycle.sh that should track largely with what you were doing before. I removed the setup of the directories from do_submit_cycle.sh since that is now being done in submit_cycle.sh and there are a couple of other changes with regards to restart files which may need to be changed. If these changes look reasonable, it would be really helpful to approve them so that we can get the release branch cut and all work from that. There are several other PRs that are all waiting to be done and we can make changes to the release branch according to what you would like to see there.
-M
On 2/15/23 11:30 AM, ClaraDraper-NOAA wrote:
@.**** commented on this pull request.
In submit_cycle.sh https://github.com/NOAA-PSL/land-offline_workflow/pull/25#discussion_r1107361521:
THISDATE=$STARTDATE date_count=0
+vec2tileexec=${BUILDDIR}/bin/vector2tile_converter.exe
The intention is that jobs are submitted using do_submit_cycle.sh, which stages the necessary files, sets any variables needed (basically does everything that needs to be done once per experiment). It then calls submit_cycle.sh, which cycles through the forecasts and DA, and will re-submit itself if need be (i.e., if we want to submit a number of smaller jobs, rather than one big job). At the moment, the PR has changes to do_submit_cycle.sh and submit_cycle.sh, as well as the addition of new submission scripts. Are you still using do_submit_cycle.sh and submit_cycle.sh, or is your intention that the new scripts replace those? we shouldn't really need separate submit scripts for different jobs - I only created the test one for convenience.
The changes to the scripts are much more extensive than I was expecting ( in a separate project, we have altered my scripts to run on the cloud using containers, by creating new settings files, changing the executable calls, and making minimal other changes, as needed). I am almost certain that the initial functionality of do_submit_cycle.sh and submit_cycle.sh on hera has not been retained. It is going to be very difficult / a lot of work for me to merge these changes into my develop branch. Either we need to revert this PR to follow the original design of the scripts, or I suggest I just merge it without really looking at it with the understanding that this work will not make it back into develop (unless it's fixed up later).
— Reply to this email directly, view it on GitHub https://github.com/NOAA-PSL/land-offline_workflow/pull/25#discussion_r1107361521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UVX744VKOV75GILYXDWXUABBANCNFSM6AAAAAAUW2EMLE. You are receiving this because you were mentioned.Message ID: @.***>
--
Mark A. Potts, Ph.D. NOAA EPIC Lead Software Engineer RedLine Performance Solutions, LLC Phone 202-744-9469 @. @.
This morning, I took the monolithic submit_sample_DA_cycle_test.sh and separated it into a submit_cycle.sh and do_submit_cycle.sh that should track largely with what you were doing before. I removed the setup of the directories from do_submit_cycle.sh since that is now being done in submit_cycle.sh and there are a couple of other changes with regards to restart files which may need to be changed. If these changes look reasonable, it would be really helpful to approve them so that we can get the release branch cut and all work from that. There are several other PRs that are all waiting to be done and we can make changes to the release branch according to what you would like to see there. -M On 2/15/23 11:30 AM, ClaraDraper-NOAA wrote: @.** commented on this pull request. ------------------------------------------------------------------------ In submit_cycle.sh <#25 (comment)>: > THISDATE=$STARTDATE date_count=0 +vec2tileexec=${BUILDDIR}/bin/vector2tile_converter.exe The intention is that jobs are submitted using do_submit_cycle.sh, which stages the necessary files, sets any variables needed (basically does everything that needs to be done once per experiment). It then calls submit_cycle.sh, which cycles through the forecasts and DA, and will re-submit itself if need be (i.e., if we want to submit a number of smaller jobs, rather than one big job). At the moment, the PR has changes to do_submit_cycle.sh and submit_cycle.sh, as well as the addition of new submission scripts. Are you still using do_submit_cycle.sh and submit_cycle.sh, or is your intention that the new scripts replace those? we shouldn't really need separate submit scripts for different jobs - I only created the test one for convenience. The changes to the scripts are much more extensive than I was expecting ( in a separate project, we have altered my scripts to run on the cloud using containers, by creating new settings files, changing the executable calls, and making minimal other changes, as needed). I am almost certain that the initial functionality of do_submit_cycle.sh and submit_cycle.sh on hera has not been retained. It is going to be very difficult / a lot of work for me to merge these changes into my develop branch. Either we need to revert this PR to follow the original design of the scripts, or I suggest I just merge it without really looking at it with the understanding that this work will not make it back into develop (unless it's fixed up later). — Reply to this email directly, view it on GitHub <#25 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UVX744VKOV75GILYXDWXUABBANCNFSM6AAAAAAUW2EMLE. You are receiving this because you were mentioned.Message ID: **@.> -- -- Mark A. Potts, Ph.D. NOAA EPIC Lead Software Engineer RedLine Performance Solutions, LLC Phone 202-744-9469 @. @.***
Thanks Mark. What's the reason for moving the directory set-up into submit_cycle.sh? Let me know when you've pushed something that you want me to look at.
@ClaraDraper-NOAA The changes are in place now. The do_submit_cycle.sh script defaults to using settings_DA_cycle_gdas, but will also work with settings_DA_cycle_era5 (using 2020/21 data). Any other settings files may need to be tweaked, but haven't been tested.
What's the reason for moving the directory set-up into submit_cycle.sh?
Subdirectories were being created in submit_cycle.sh as well as top level in do_submit_cycle.sh, so the creation was all consolidated.
Subdirectories were being created in submit_cycle.sh as well as top level in do_submit_cycle.sh, so the creation was all consolidated.
In my original submit_cycle.sh there are no directories created. I will do a formal review later today, but unless there's a reason it needs to be moved, I going to ask you to move all of the directory creation back to do_submit_cycle.sh. I'm trying to minimize the code changes, to make it easier to merge.
This PR includes updates to DA_update and is mostly designed to make the system more portable.
Currently, the submit_cycle.sh script is set to use settings_cycle_test and is expecting input data to be in the "inputs" directory that is in the parent directory of land-offline_workflow. Ultimately, I would like to separate the two, but this is an attempt to get everyone working from a common point and get testing underway.
This has been tested with 2016 data and can be run as follows--
create a test directory that will be LANDDAROOT
cd land-test
clone the feature branch recursively
git clone -b feature/release-v1.beta2 --recursive https://github.com/NOAA-EPIC/land-offline_workflow.git
download the data
wget https://epic-sandbox-srw.s3.amazonaws.com/landda-data-2016.tar.gz
untar the data
tar xvfz landda-data-2016.tar.gz
cd into the workflow directory
cd land-offline_workflow/
load the module files
module use $PWD/modulefiles module load landda_orion.intel (change to hera for Hera)
create a build directory and cd into it
mkdir build cd build
run the ecbuild command to configure
ecbuild ..
build everything
make -j 8
back up to the main directory
cd ..
submit a sample job that will run two days
sbatch submit_sample_DA_cycle_test.sh
check on the status
squeue -u $USER
watch the output go by
tail -f log err
need to hit control-c to exit tail, but then you can check for the background and analysis files in the cycle_land directory
ls -l ../cycle_land/DA_GHCN_test/mem000/restarts/vector/
TO DO:
Set up a test to verify that the current version is working correctly and that future changes don't break anything. Separate out Inputs from LANDDAROOT variable. Test on 2020/2021 data. Organize environment variables so that they are all set with reasonable defaults in either DA sections or land driver sections and can be overridden Add a "check_environment.sh" script that will check to see if everything needed to run is in place.