Closed robertbartel closed 6 months ago
Can you make a pip list
of the runtime python env?
@hellkite500, sure:
pip list
for ngen worker imageCan you try with pyarrow 11? Still not sure that underlying issue has been completely addressed upstream.
Yeah I suspect it is either pyarrow or tables. How are you installing tables?
Yeah I suspect it is either pyarrow or tables. How are you installing tables?
I've tweaked the image to ensure pyarrow 11.0.0 is installed. This was the command to install tables:
env HDF5_DIR=/usr pip3 install --no-cache-dir --no-build-isolation tables
I may be installing t-route incorrectly somehow, as I'm getting this error now. I'll continue looking into it.
FAIL: Unable to import a supported routing module.
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): ModuleNotFoundError: No module named 'troute.config'
At:
/usr/local/lib/python3.9/site-packages/nwm_routing/input.py(10): <module>
<frozen importlib._bootstrap>(228): _call_with_frames_removed
<frozen importlib._bootstrap_external>(850): exec_module
<frozen importlib._bootstrap>(695): _load_unlocked
<frozen importlib._bootstrap>(986): _find_and_load_unlocked
<frozen importlib._bootstrap>(1007): _find_and_load
/usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(17): <module>
<frozen importlib._bootstrap>(228): _call_with_frames_removed
<frozen importlib._bootstrap_external>(850): exec_module
<frozen importlib._bootstrap>(695): _load_unlocked
<frozen importlib._bootstrap>(986): _find_and_load_unlocked
<frozen importlib._bootstrap>(1007): _find_and_load
There is a new package/step needed with recent versions of t-route.
As an aside, I still have trouble installing the netCDF4
Python package. I can make the image work with v1.6.3
if I use the binary package, but if I ever try to build it (even going the route of cloning the source tree) the build dependencies won't properly bring in mpi4py
.
I don't think at this point that's contributing to the primary error, but it could be an issue later.
Yeah it looks like troute.config
is not being installed by t-routes install script. You can installed it with:
pip install "git+https://github.com/noaa-owp/t-route@master#egg=troute_config&subdirectory=src/troute-config"
Sorry, was AFK. Just looked at the install script and it looks like it should be installing troute.config
.
@robertbartel, are you checking out a specific commit or branch?
I may have the issues fixed in the image to get t-route working, though now I am running into some peculiar configuration validation errors:
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): ValidationError: 5 validation errors for Config
compute_parameters -> data_assimilation_parameters -> streamflow_da -> lastobs_output_folder
extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> streamflow_da -> wrf_hydro_lastobs_file
extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> gage_lakeID_crosswalk_file
extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> reservoir_persistence_usace
extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> reservoir_persistence_usgs
extra fields not permitted (type=value_error.extra)
@yuqiong77 provided the original config I was using for testing. I don't have enough experience with t-route to sanity check things beyond ~not seeing these "extra fields" in the t-route config documentation~ (correction, they are in the example file ... I'll need to dig some more on that), but they are specific enough for me to remain a bit uncertain.
Regardless, I am at least going to tweak the configuration and run tests until I get a successful job completion.
Happy New Year! I'm pressed for time to complete some multi-year streamflow simulation runs (either within the ngen image Bobby has helped build or as an post-processing step) for my AMS presentation. My sincerest thanks to you all for looking into the t-route issue.
I'm going to put together at least a draft PR for this to build images for @yuqiong77, but I'm still running into an error. It does appear to be a more t-route-specific problem - perhaps still related to the configuration - and not one with the image.
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): AttributeError: 'NoneType' object has no attribute 'get'
At:
/usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(32): read_geopkg
/usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(154): read_geo_file
/usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(253): __init__
/usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(80): main_v04
Doing some limited checking, it looks like this is implying data_assimilation_parameters
are None
from the start, but there is at least something configured (and uncommented) in that section of the t-route config file I'm using. Again, we've gone outside my expertise, and perhaps the configuration simply needs some adjustment.
@robertbartel Thanks. I also suspect the config I used (which was based on an example found in the t-route repository a few weeks ago) may have some issues. The example config file looks quite different from the t-route config files I used back in 2022, which did not have a data assimilation section.
Looking at the DA section of the current config, I think the only line that may cause an issue is the following:
lastobs_output_folder : lastobs/
What if we comment out that line?
There seem to be at least some t-route problems contributing to this, which I've opened issue NOAA-OWP/t-route#719 to track.
I think the problems in part are due to using a troute v3.0 config with troute v4.0 execution. If I tweak part of the data_assimilation_parameters
config like this:
reservoir_da:
#----------
reservoir_persistence_da:
reservoir_persistence_usgs : False
reservoir_persistence_usace : False
Then I get past the earlier attribute and validation errors, although now I run into this/these:
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): KeyError: 'downstream'
At:
/usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3798): get_loc
/usr/local/lib64/python3.9/site-packages/pandas/core/frame.py(3893): __getitem__
/usr/local/lib/python3.9/site-packages/geopandas/geodataframe.py(1474): __getitem__
/usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(352): preprocess_network
/usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(269): __init__
/usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(80): main_v04
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 8 leaked semlock objects to clean up at shutdown
warnings.warn(
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 2 leaked folder objects to clean up at shutdown
warnings.warn(
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_22_7605d254530b40fc919513833b8b0a71_79e4edc7c8d34a3cbd66375e0821ed87: FileNotFoundError(2, 'No such file or directory')
warnings.warn(f"resource_tracker: {name}: {e!r}")
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_22_7605d254530b40fc919513833b8b0a71_50b6587dc9f240e98727d944908e48e2: FileNotFoundError(2, 'No such file or directory')
warnings.warn(f"resource_tracker: {name}: {e!r}")
Hi Bobby,
Thanks for figuring out the mismatch between t-route config and execution. I find a v4 example of the config in the repository:
Based on that, I modified my config file on UCS6:
/local/model_as_a_service/yuqiong/data/troute_config.yaml
I now get the following error:
Finished 744 timesteps. creating supernetwork connections set 2024-01-09 00:37:13,738 INFO [AbstractNetwork.py:489 - create_independent_networks()]: organizing connections into reaches ... 2024-01-09 00:37:13,785 DEBUG [AbstractNetwork.py:518 - create_independent_networks()]: reach organization complete in 0.04627418518066406 seconds. 2024-01-09 00:37:13,785 INFO [AbstractNetwork.py:646 - initial_warmstate_preprocess()]: setting channel initial states ... 2024-01-09 00:37:13,785 DEBUG [AbstractNetwork.py:701 - initial_warmstate_preprocess()]: channel initial states complete in 0.0003256797790527344 seconds. terminate called after throwing an instance of 'pybind11::error_already_set' what(): ZeroDivisionError: division by zero
At: /usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(801): build_forcing_sets /usr/local/lib/python3.9/site-packages/nwm_routing/main.py(108): main_v04
Any hint?
Indeed, I encountered the ZeroDivisionError
as well. I made some further modifications to the config - mostly under forcing_parameters
- to get to the config I'll attach here. I think at this point the troute config is valid and the Docker image is built properly (with respect to troute). Note that I had to compress it to get it to attach, so you'll need to gunzip
it first.
There is still some trouble though. In short, ngen seems to be outputting a bogus line at the end of one of the terminal nexus output files (in particular, ~the one with the largest numeric feature id~ edit: my mistake: the trouble was with tnx-1000000099_output.csv
). I'm going to work on debugging that some today.
Bobby, which tnx file are you referring to specifically? I opened tnx-1000000687_output.csv
(the one with the largest numeric id). The last line looked normal to me.
@yuqiong77, we were having issues with tnx-1000000099_output.csv
. There is an extra line with the contents 0, 4.08443
at the end of the file.
Thanks! I see that now. Although the last line in my file tnx-1000000099_output.csv
looks a bit different:
743, 2012-10-31 23:00:00, 1.22727 .53858
For sure, @yuqiong77! Well that is odd. I am just jumping back into this thread, so I am not sure if @robertbartel was using a different set of forcing data that you are for your simulations. With the modifications, @robertbartel suggested to make to the t-route
config, were you able to get a full end to end run of NextGen working? Or are you still running into the divide by zero error?
Probably ignore this, just documenting it because it is related. As @robertbartel, found out yesterday, the extra line in the tnx-
csv file mentioned in my previous comment is the source of an InvalidIndexError
that gets thrown by t-route
(see collapsed stack trace).
In short, t-route
is trying to concatenate pandas
DataFrames by row. Each DataFrame is indexed by feature_id
(so the 1000000099
in tnx-1000000099_output.csv
), however because of the added line mentioned above, 1000000099
ends up being an index value twice. Pandas cannot concatenate by row DataFrames with non-unique index values.
@aaraney I just tested with the config file that @robertbartel posted (I think my config had the binary_nexus_file_foler
line commented out). The divided by zero
error is gone. Now I'm getting the same InvalidIndexError
error message you posted above.
2024-01-09 16:50:42,859 INFO [AbstractNetwork.py:125 - assemble_forcings()]: Creating a DataFrame of lateral inflow forcings ... terminate called after throwing an instance of 'pybind11::error_already_set' what(): InvalidIndexError: Reindexing only valid with uniquely valued Index objects
At: /usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3875): get_indexer /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(676): get_result /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(393): concat /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(611): build_qlateral_array /usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(127): assemble_forcings /usr/local/lib/python3.9/site-packages/nwm_routing/main.py(121): main_v04
@yuqiong77, well at least we are having issues in the same place! From the directory where the NextGen output files are can you please run find . -name "*_output.csv" -exec awk -F ',' 'NF != 3 {print FILENAME}' {} ';'
? This should tell you output files that have spurious lines.
@aaraney Thanks! I found a total 38 tnx output files with spurious lines. These lines can appear anywhere within the file, not just the end of the file. Some of the spurious lines are just entirely empty.
Thanks, @yuqiong77! That is so weird, but at least we are starting to better understand what is going on. Im getting a local debugging environment setup so I can try and reproduce this myself.
Made more progress, but ran into another bug in t-route
that was giving me a ValueError: could not convert string to float:
error. The full stack trace is posted below. I found a solution to the problem, but I dont know if it is the right way to fix it. I opened https://github.com/NOAA-OWP/t-route/issues/724 to ask the t-route
team if that is the right way to fix it.
Sean, confirmed https://github.com/NOAA-OWP/t-route/issues/724 was an issue. Just pushed up a fix https://github.com/NOAA-OWP/t-route/pull/725.
edit: fix has been merged.
@yuqiong77, so I was able to run a month long simulation over HUC 1 with routing enabled after resolving the above issues with NextGen compiled in serial mode (no mpi support). @robertbartel, discovered that the spurious additional lines in tnx-
files only appear when NextGen is compiled in parallel mode (mpi turned on). We will continue to investigate the issue, but in the short term, please try running the framework in serial mode.
@aaraney Thanks for letting me know. Great progress!
@yuqiong77, minor update this morning. I rebuilt the NextGen
image this morning and verified that that the ngen-serial
(this is in /dmod/bin/
) binary works as expected. When you rebuild, I would suggest using the --no-cache
flag to docker (i.e. docker build --no-cache <other-things>
). This will ensure that you build the fixed version of t-route
and the fixes to the cfe
module.
After building, you will need to update your NextGen realization config file. The cfe
module changed the name of one it's input variable ice_fraction_xinan
-> ice_fraction_xinanjiang
. So, you will need to update the cfe
's variable_names_map
in the realization config. If I did a poor job communicating that, see and example of the change you will need to make in the data/example_bmi_multi_realization_config.json
file in this PR!
@aaraney Thanks! @robertbartel helped me build the original image on UCS6, so I'm asking him to help rebuild the image to include recent updates.
Just curious, how long did it take you to complete the month long run over HUC-01 in the serial mode? I've always been running HUC-01 wide simulations in the parallel mode in the past.
👍
~55 minutes on a 4 core virtual machine.
Thanks for all the help. I'm glad to report that I was able to successfully run ngen+t-route
for 6 months in the serial mode with the updated image that @robertbartel has helped built on Friday. The total run time was about 10 hours.
However, whenever I tried to run ngen+t-route
for more than a year, the program always stops at ~ 9 or 10 months into the run with the following message:
emitted longwave <0; skin T may be wrong due to inconsistent input of SHDFAC with LAI 2147483647 2147483647 SHDFAC= 0.699999988 parameters%VAI= 4.53397274 TV= 289.172455 TG= 245.799484 LWDN= 370.000000 energy%FIRA= -3049.29419 water%SNOWH= 0.00000000 Exiting ...
Have any of you run into this issue before? Tried multiple runs with different starting times and all failed after ~ 9 to 10 months. The issue is related to running the CFE model. Should we open an issue in the relevant branch? Any hint/recommendation is appreciated.
@yuqiong77, glad to hear the serial simulations ran! Sorry to hear you found another issue 😅. I've not run into that issue personally. Do you have an insight @ajkhattak?
Aside, we will probably end up moving this to a CFE issue if that ends up being the case, just trying to capture the full context here. Thanks for the patience, @yuqiong77!
@aaraney @yuqiong77 This issue is not caused by CFE instead it is happening in the noah-owp-modular (here).
LWDN= 370.000000 energy%FIRA= -3049.29419 (from your error)
FIRE = forcing%LWDN + energy%FIRA (from the code).
If FIRE is negative or zero, it stops execution, so something went wrong with energy%FIRA calculations
(maybe wrong/inconsistent input values at this particular timestep, etc.).
@SnowHydrology
I don't have much to add past what Ahmad said. This particular error is a common one with Noah-MP and its derivatives, e.g.:
Thanks @ajkhattak @SnowHydrology. I will look into the links to see if I can find something useful. At this point, it just seems that the the run would exit after a certain number of time steps (~ 9 to 10 months, hourly time step), regardless of the starting time.
@yuqiong77 I think it should depend on your starting time. Let's say the problematic time (when it crashes) is 11-11-2020 (November 11th) and you started it on 01-01-2020 (January 1st), so it will run for 11 months. Now if you start on 01-01-2019, it would run for 23 months. Have you checked if your times in the ngen realization and noah-owp-modular input file are consistent?
@ajkhattak Yes, the start and end times in my ngen realization and noah-owp input files were consistent. After many run experiments, I have come to realize that the run would always crash after running for 9 to 10 months, rather at a fixed time stamp. For example, if the model crashes at 07/01/2013 during a first run that starts from 10/1/2012, it would run pass 07/01/2013 in a second run that starts from 06/01/2013 and then crashes around a new time 2014/04/01. Looks as if there were errors accumulating the model as the run progresses until it reached a break point ...
@yuqiong77 ah I see, sorry I overlooked your text, so the time it crashes is not exact. But what if your start time is fixed, will it crash at the same time every time you rerun? Anyway, does it happen for all catchments in your basin or just one catchment? If we can reproduce this problem on a single catchment, we can debug it easily
@ajkhattak I'm actually not sure at which catchments this problem has occurred. It is not obvious to me from the standout messages, since no specific catchment names are mentioned there. Note I'm trying to run ngen for the entire HUC-01 region, which has > 20000 catchments.
@yuqiong77 I understand it. Nextgen team might be able to help debug -- at least identify the catchment that is causing the problem. @hellkite500 do we have any verbosity options in the framework that we can set to screen output some metadata about the simulation state (timestep, catchment ID, model call, etc.)?
@yuqiong77, in your noaa owp namelist files, what setting did you use for dynamic_veg_option
?
@yuqiong77 Can you share an example of the namelist you're using?
@aaraney dynamic_veg_option
is 4 in the noaa owp namelist files, which were from @robertbartel and I only changed the start and end times at the beginning of the files. @SnowHydrology an example of the namelist file is attached.
@yuqiong77, I dont have an answer as to why things are breaking. However it looks like something is going awry in the calculation of net longwave radiation (w/m2) values IRG
, IRB
or IRC
.
We know that FIRA is getting set like this (source):
energy%FIRA = parameters%FVEG * energy%IRG + (1.0 - parameters%FVEG) * energy%IRB + energy%IRC
And we know that FIRE = forcing%LWDN + energy%FIRA
or from the output FIRE = 370.0 + -3049.29419
. So, FIRA
is the issue, not the forcing.
From your error output, we know FVEG
(SHDFAC
in output) and the FIRA
calculation so we know:
-3049.29419 = 0.69 * energy%IRG + (1.0 - 0.69) * energy%IRB + energy%IRC
So, something is going wrong in one or more of the net longwave radiation calculations. This is not my domain expertise, so that may be helpful or it might not be. @ajkhattak, does that mean anything to you?
@aaraney That is a mostly correct interpretation of what's happening. A few addendums:
Attempts to run framework-integrated t-route execution are failing. Initially, these were encountering a segmentation fault. After some experimental fix attempts, the errors changed first from to a signal 6, then to a signal 7, but t-route still does not run successfully.
The initial suspicion was a problem related to a known NetCDF Python package issue, which is what early fix tries attempted to address (this may still be the root of what's going on).