NOAA-OWP / DMOD

Distributed Model on Demand infrastructure for OWP's Model as a Service
Other
7 stars 15 forks source link

Failure running t-route in ngen worker image #472

Closed robertbartel closed 6 months ago

robertbartel commented 11 months ago

Attempts to run framework-integrated t-route execution are failing. Initially, these were encountering a segmentation fault. After some experimental fix attempts, the errors changed first from to a signal 6, then to a signal 7, but t-route still does not run successfully.

The initial suspicion was a problem related to a known NetCDF Python package issue, which is what early fix tries attempted to address (this may still be the root of what's going on).

hellkite500 commented 11 months ago

Can you make a pip list of the runtime python env?

robertbartel commented 11 months ago

@hellkite500, sure:

Output of pip list for ngen worker image ``` bash [mpi@env4 ngen]$ pip list WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages) Package Version ------------------ ------------ attrs 23.1.0 black 23.11.0 blosc2 2.3.2 bmipy 2.0.1 certifi 2023.11.17 cftime 1.6.3 click 8.1.7 click-plugins 1.1.1 cligj 0.7.2 Cython 3.0.6 dbus-python 1.2.18 Deprecated 1.2.14 fiona 1.9.5 geopandas 0.14.1 gpg 1.15.1 importlib-metadata 7.0.0 Jinja2 3.1.2 joblib 1.3.2 libcomps 0.1.18 MarkupSafe 2.1.3 msgpack 1.0.7 mypy-extensions 1.0.0 ndindex 1.7 netCDF4 1.6.3 numexpr 2.8.7 numpy 1.26.2 nwm-routing 0.0.0 packaging 23.2 pandas 2.1.4 pathspec 0.11.2 pip 23.0.1 platformdirs 4.1.0 py-cpuinfo 9.0.0 pyarrow 14.0.1 pyproj 3.6.1 python-dateutil 2.8.2 pytz 2023.3.post1 PyYAML 6.0.1 rpm 4.16.1.3 setuptools 53.0.0 shapely 2.0.2 six 1.15.0 systemd-python 234 tables 3.9.2 tomli 2.0.1 toolz 0.12.0 troute.network 0.0.0 troute.routing 0.0.0 typing_extensions 4.8.0 tzdata 2023.3 wheel 0.42.0 wrapt 1.16.0 xarray 2023.11.0 zipp 3.17.0 WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages) WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages) WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages) [notice] A new release of pip is available: 23.0.1 -> 23.3.2 [notice] To update, run: python3 -m pip install --upgrade pip ```
hellkite500 commented 11 months ago

Can you try with pyarrow 11? Still not sure that underlying issue has been completely addressed upstream.

aaraney commented 11 months ago

Yeah I suspect it is either pyarrow or tables. How are you installing tables?

robertbartel commented 11 months ago

Yeah I suspect it is either pyarrow or tables. How are you installing tables?

I've tweaked the image to ensure pyarrow 11.0.0 is installed. This was the command to install tables:

env HDF5_DIR=/usr pip3 install --no-cache-dir --no-build-isolation tables

I may be installing t-route incorrectly somehow, as I'm getting this error now. I'll continue looking into it.

FAIL: Unable to import a supported routing module.
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  ModuleNotFoundError: No module named 'troute.config'

At:
  /usr/local/lib/python3.9/site-packages/nwm_routing/input.py(10): <module>
  <frozen importlib._bootstrap>(228): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(850): exec_module
  <frozen importlib._bootstrap>(695): _load_unlocked
  <frozen importlib._bootstrap>(986): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1007): _find_and_load
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(17): <module>
  <frozen importlib._bootstrap>(228): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(850): exec_module
  <frozen importlib._bootstrap>(695): _load_unlocked
  <frozen importlib._bootstrap>(986): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1007): _find_and_load
hellkite500 commented 11 months ago

There is a new package/step needed with recent versions of t-route.

robertbartel commented 11 months ago

As an aside, I still have trouble installing the netCDF4 Python package. I can make the image work with v1.6.3 if I use the binary package, but if I ever try to build it (even going the route of cloning the source tree) the build dependencies won't properly bring in mpi4py.

I don't think at this point that's contributing to the primary error, but it could be an issue later.

hellkite500 commented 11 months ago

https://github.com/CIROH-UA/NGIAB-CloudInfra/blob/main/docker%2FDockerfile.t-route#L56

aaraney commented 11 months ago

Yeah it looks like troute.config is not being installed by t-routes install script. You can installed it with:

pip install "git+https://github.com/noaa-owp/t-route@master#egg=troute_config&subdirectory=src/troute-config"

aaraney commented 11 months ago

Sorry, was AFK. Just looked at the install script and it looks like it should be installing troute.config.

aaraney commented 11 months ago

@robertbartel, are you checking out a specific commit or branch?

robertbartel commented 10 months ago

I may have the issues fixed in the image to get t-route working, though now I am running into some peculiar configuration validation errors:

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  ValidationError: 5 validation errors for Config
compute_parameters -> data_assimilation_parameters -> streamflow_da -> lastobs_output_folder
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> streamflow_da -> wrf_hydro_lastobs_file
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> gage_lakeID_crosswalk_file
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> reservoir_persistence_usace
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> reservoir_persistence_usgs
  extra fields not permitted (type=value_error.extra)

@yuqiong77 provided the original config I was using for testing. I don't have enough experience with t-route to sanity check things beyond ~not seeing these "extra fields" in the t-route config documentation~ (correction, they are in the example file ... I'll need to dig some more on that), but they are specific enough for me to remain a bit uncertain.

Regardless, I am at least going to tweak the configuration and run tests until I get a successful job completion.

yuqiong77 commented 10 months ago

Happy New Year! I'm pressed for time to complete some multi-year streamflow simulation runs (either within the ngen image Bobby has helped build or as an post-processing step) for my AMS presentation. My sincerest thanks to you all for looking into the t-route issue.

robertbartel commented 10 months ago

I'm going to put together at least a draft PR for this to build images for @yuqiong77, but I'm still running into an error. It does appear to be a more t-route-specific problem - perhaps still related to the configuration - and not one with the image.

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  AttributeError: 'NoneType' object has no attribute 'get'

At:
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(32): read_geopkg
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(154): read_geo_file
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(253): __init__
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(80): main_v04

Doing some limited checking, it looks like this is implying data_assimilation_parameters are None from the start, but there is at least something configured (and uncommented) in that section of the t-route config file I'm using. Again, we've gone outside my expertise, and perhaps the configuration simply needs some adjustment.

yuqiong77 commented 10 months ago

@robertbartel Thanks. I also suspect the config I used (which was based on an example found in the t-route repository a few weeks ago) may have some issues. The example config file looks quite different from the t-route config files I used back in 2022, which did not have a data assimilation section.

Looking at the DA section of the current config, I think the only line that may cause an issue is the following:

lastobs_output_folder : lastobs/

What if we comment out that line?

robertbartel commented 10 months ago

There seem to be at least some t-route problems contributing to this, which I've opened issue NOAA-OWP/t-route#719 to track.

robertbartel commented 10 months ago

I think the problems in part are due to using a troute v3.0 config with troute v4.0 execution. If I tweak part of the data_assimilation_parameters config like this:

        reservoir_da:
            #----------
            reservoir_persistence_da:
              reservoir_persistence_usgs  : False
              reservoir_persistence_usace : False

Then I get past the earlier attribute and validation errors, although now I run into this/these:

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  KeyError: 'downstream'

At:
  /usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3798): get_loc
  /usr/local/lib64/python3.9/site-packages/pandas/core/frame.py(3893): __getitem__
  /usr/local/lib/python3.9/site-packages/geopandas/geodataframe.py(1474): __getitem__
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(352): preprocess_network
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(269): __init__
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(80): main_v04

/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 8 leaked semlock objects to clean up at shutdown
  warnings.warn(
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 2 leaked folder objects to clean up at shutdown
  warnings.warn(
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_22_7605d254530b40fc919513833b8b0a71_79e4edc7c8d34a3cbd66375e0821ed87: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_22_7605d254530b40fc919513833b8b0a71_50b6587dc9f240e98727d944908e48e2: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")
yuqiong77 commented 10 months ago

Hi Bobby,

Thanks for figuring out the mismatch between t-route config and execution. I find a v4 example of the config in the repository:

v4 example

Based on that, I modified my config file on UCS6:

/local/model_as_a_service/yuqiong/data/troute_config.yaml

I now get the following error:

Finished 744 timesteps. creating supernetwork connections set 2024-01-09 00:37:13,738 INFO [AbstractNetwork.py:489 - create_independent_networks()]: organizing connections into reaches ... 2024-01-09 00:37:13,785 DEBUG [AbstractNetwork.py:518 - create_independent_networks()]: reach organization complete in 0.04627418518066406 seconds. 2024-01-09 00:37:13,785 INFO [AbstractNetwork.py:646 - initial_warmstate_preprocess()]: setting channel initial states ... 2024-01-09 00:37:13,785 DEBUG [AbstractNetwork.py:701 - initial_warmstate_preprocess()]: channel initial states complete in 0.0003256797790527344 seconds. terminate called after throwing an instance of 'pybind11::error_already_set' what(): ZeroDivisionError: division by zero

At: /usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(801): build_forcing_sets /usr/local/lib/python3.9/site-packages/nwm_routing/main.py(108): main_v04

Any hint?

robertbartel commented 10 months ago

Indeed, I encountered the ZeroDivisionError as well. I made some further modifications to the config - mostly under forcing_parameters - to get to the config I'll attach here. I think at this point the troute config is valid and the Docker image is built properly (with respect to troute). Note that I had to compress it to get it to attach, so you'll need to gunzip it first.

troute_config.yaml.gz

There is still some trouble though. In short, ngen seems to be outputting a bogus line at the end of one of the terminal nexus output files (in particular, ~the one with the largest numeric feature id~ edit: my mistake: the trouble was with tnx-1000000099_output.csv). I'm going to work on debugging that some today.

yuqiong77 commented 10 months ago

Bobby, which tnx file are you referring to specifically? I opened tnx-1000000687_output.csv (the one with the largest numeric id). The last line looked normal to me.

aaraney commented 10 months ago

@yuqiong77, we were having issues with tnx-1000000099_output.csv. There is an extra line with the contents 0, 4.08443 at the end of the file.

yuqiong77 commented 10 months ago

Thanks! I see that now. Although the last line in my file tnx-1000000099_output.csv looks a bit different:

743, 2012-10-31 23:00:00, 1.22727 .53858

aaraney commented 10 months ago

For sure, @yuqiong77! Well that is odd. I am just jumping back into this thread, so I am not sure if @robertbartel was using a different set of forcing data that you are for your simulations. With the modifications, @robertbartel suggested to make to the t-route config, were you able to get a full end to end run of NextGen working? Or are you still running into the divide by zero error?

aaraney commented 10 months ago

Probably ignore this, just documenting it because it is related. As @robertbartel, found out yesterday, the extra line in the tnx- csv file mentioned in my previous comment is the source of an InvalidIndexError that gets thrown by t-route (see collapsed stack trace).

stack trace ```shell 2024-01-08 20:00:06,620 INFO [AbstractNetwork.py:125 - assemble_forcings()]: Creating a DataFrame of lateral inflow forcings ... terminate called after throwing an instance of 'pybind11::error_already_set' what(): InvalidIndexError: Reindexing only valid with uniquely valued Index objects At: /usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3875): get_indexer /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(676): get_result /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(393): concat /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(611): build_qlateral_array /usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(127): assemble_forcings /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(121): main_v04 ```

In short, t-route is trying to concatenate pandas DataFrames by row. Each DataFrame is indexed by feature_id (so the 1000000099 in tnx-1000000099_output.csv), however because of the added line mentioned above, 1000000099 ends up being an index value twice. Pandas cannot concatenate by row DataFrames with non-unique index values.

yuqiong77 commented 10 months ago

@aaraney I just tested with the config file that @robertbartel posted (I think my config had the binary_nexus_file_foler line commented out). The divided by zero error is gone. Now I'm getting the same InvalidIndexError error message you posted above.

2024-01-09 16:50:42,859 INFO [AbstractNetwork.py:125 - assemble_forcings()]: Creating a DataFrame of lateral inflow forcings ... terminate called after throwing an instance of 'pybind11::error_already_set' what(): InvalidIndexError: Reindexing only valid with uniquely valued Index objects

At: /usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3875): get_indexer /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(676): get_result /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(393): concat /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(611): build_qlateral_array /usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(127): assemble_forcings /usr/local/lib/python3.9/site-packages/nwm_routing/main.py(121): main_v04

aaraney commented 10 months ago

@yuqiong77, well at least we are having issues in the same place! From the directory where the NextGen output files are can you please run find . -name "*_output.csv" -exec awk -F ',' 'NF != 3 {print FILENAME}' {} ';'? This should tell you output files that have spurious lines.

yuqiong77 commented 10 months ago

@aaraney Thanks! I found a total 38 tnx output files with spurious lines. These lines can appear anywhere within the file, not just the end of the file. Some of the spurious lines are just entirely empty.

aaraney commented 10 months ago

Thanks, @yuqiong77! That is so weird, but at least we are starting to better understand what is going on. Im getting a local debugging environment setup so I can try and reproduce this myself.

aaraney commented 10 months ago

Made more progress, but ran into another bug in t-route that was giving me a ValueError: could not convert string to float: error. The full stack trace is posted below. I found a solution to the problem, but I dont know if it is the right way to fix it. I opened https://github.com/NOAA-OWP/t-route/issues/724 to ask the t-route team if that is the right way to fix it.

Stack Trace ```shell 2024-01-10 22:34:26,847 INFO [AbstractNetwork.py:489 - create_independent_networks()]: organizing connections into reaches ... 2024-01-10 22:34:26,867 DEBUG [AbstractNetwork.py:518 - create_independent_networks()]: reach organization complete in 0.02008223533630371 seconds. 2024-01-10 22:34:26,867 INFO [AbstractNetwork.py:646 - initial_warmstate_preprocess()]: setting channel initial states ... 2024-01-10 22:34:26,867 DEBUG [AbstractNetwork.py:701 - initial_warmstate_preprocess()]: channel initial states complete in 0.0001633167266845703 seconds. Reformating qlat nexus files as hourly binary files... 2024-01-10 22:35:30,085 INFO [AbstractNetwork.py:125 - assemble_forcings()]: Creating a DataFrame of lateral inflow forcings ... 2024-01-10 22:35:30,188 DEBUG [AbstractNetwork.py:131 - assemble_forcings()]: lateral inflow DataFrame creation complete in 0.10307741165161133 seconds. 2024-01-10 22:35:30,189 INFO [__main__.py:1071 - nwm_route()]: executing routing computation ... terminate called after throwing an instance of 'pybind11::error_already_set' what(): ValueError: could not convert string to float: '717802,717794' At: /usr/local/lib64/python3.9/site-packages/pandas/core/dtypes/astype.py(134): _astype_nansafe /usr/local/lib64/python3.9/site-packages/pandas/core/dtypes/astype.py(183): astype_array /usr/local/lib64/python3.9/site-packages/pandas/core/dtypes/astype.py(245): astype_array_safe /usr/local/lib64/python3.9/site-packages/pandas/core/internals/blocks.py(616): astype /usr/local/lib64/python3.9/site-packages/pandas/core/internals/managers.py(354): apply /usr/local/lib64/python3.9/site-packages/pandas/core/internals/managers.py(414): astype /usr/local/lib64/python3.9/site-packages/pandas/core/generic.py(6534): astype /usr/local/lib64/python3.9/site-packages/troute/routing/compute.py(308): compute_nhd_routing_v02 /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(1074): nwm_route /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(162): main_v04 Aborted [mpi@60a692ceb33e ngen]$ /usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 8 leaked semlock objects to clean up at shutdown warnings.warn( /usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 2 leaked folder objects to clean up at shutdown warnings.warn( /usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_4302_b5d96e5242634ecebabf6726d4eedf09_42bbe0c41a7a4dcc9585a290ea25338b: FileNotFoundError(2, 'No such file or directory') warnings.warn(f"resource_tracker: {name}: {e!r}") /usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_4302_b5d96e5242634ecebabf6726d4eedf09_3a932d10320e42648d9eaf66569e80a5: FileNotFoundError(2, 'No such file or directory') warnings.warn(f"resource_tracker: {name}: {e!r}") ```
aaraney commented 10 months ago

Sean, confirmed https://github.com/NOAA-OWP/t-route/issues/724 was an issue. Just pushed up a fix https://github.com/NOAA-OWP/t-route/pull/725.

edit: fix has been merged.

aaraney commented 10 months ago

@yuqiong77, so I was able to run a month long simulation over HUC 1 with routing enabled after resolving the above issues with NextGen compiled in serial mode (no mpi support). @robertbartel, discovered that the spurious additional lines in tnx- files only appear when NextGen is compiled in parallel mode (mpi turned on). We will continue to investigate the issue, but in the short term, please try running the framework in serial mode.

yuqiong77 commented 10 months ago

@aaraney Thanks for letting me know. Great progress!

aaraney commented 10 months ago

@yuqiong77, minor update this morning. I rebuilt the NextGen image this morning and verified that that the ngen-serial (this is in /dmod/bin/) binary works as expected. When you rebuild, I would suggest using the --no-cache flag to docker (i.e. docker build --no-cache <other-things>). This will ensure that you build the fixed version of t-route and the fixes to the cfe module.

After building, you will need to update your NextGen realization config file. The cfe module changed the name of one it's input variable ice_fraction_xinan -> ice_fraction_xinanjiang. So, you will need to update the cfe's variable_names_map in the realization config. If I did a poor job communicating that, see and example of the change you will need to make in the data/example_bmi_multi_realization_config.json file in this PR!

yuqiong77 commented 10 months ago

@aaraney Thanks! @robertbartel helped me build the original image on UCS6, so I'm asking him to help rebuild the image to include recent updates.

Just curious, how long did it take you to complete the month long run over HUC-01 in the serial mode? I've always been running HUC-01 wide simulations in the parallel mode in the past.

aaraney commented 10 months ago

👍

~55 minutes on a 4 core virtual machine.

yuqiong77 commented 10 months ago

Thanks for all the help. I'm glad to report that I was able to successfully run ngen+t-route for 6 months in the serial mode with the updated image that @robertbartel has helped built on Friday. The total run time was about 10 hours.

However, whenever I tried to run ngen+t-route for more than a year, the program always stops at ~ 9 or 10 months into the run with the following message:

emitted longwave <0; skin T may be wrong due to inconsistent input of SHDFAC with LAI 2147483647 2147483647 SHDFAC= 0.699999988 parameters%VAI= 4.53397274 TV= 289.172455 TG= 245.799484 LWDN= 370.000000 energy%FIRA= -3049.29419 water%SNOWH= 0.00000000 Exiting ...

Have any of you run into this issue before? Tried multiple runs with different starting times and all failed after ~ 9 to 10 months. The issue is related to running the CFE model. Should we open an issue in the relevant branch? Any hint/recommendation is appreciated.

aaraney commented 10 months ago

@yuqiong77, glad to hear the serial simulations ran! Sorry to hear you found another issue 😅. I've not run into that issue personally. Do you have an insight @ajkhattak?

Aside, we will probably end up moving this to a CFE issue if that ends up being the case, just trying to capture the full context here. Thanks for the patience, @yuqiong77!

ajkhattak commented 10 months ago

@aaraney @yuqiong77 This issue is not caused by CFE instead it is happening in the noah-owp-modular (here).

LWDN= 370.000000 energy%FIRA= -3049.29419 (from your error)
FIRE = forcing%LWDN + energy%FIRA (from the code).
If FIRE is negative or zero, it stops execution, so something went wrong with energy%FIRA calculations 
(maybe wrong/inconsistent input values at this particular timestep, etc.). 

@SnowHydrology

SnowHydrology commented 10 months ago

I don't have much to add past what Ahmad said. This particular error is a common one with Noah-MP and its derivatives, e.g.:

yuqiong77 commented 10 months ago

Thanks @ajkhattak @SnowHydrology. I will look into the links to see if I can find something useful. At this point, it just seems that the the run would exit after a certain number of time steps (~ 9 to 10 months, hourly time step), regardless of the starting time.

ajkhattak commented 10 months ago

@yuqiong77 I think it should depend on your starting time. Let's say the problematic time (when it crashes) is 11-11-2020 (November 11th) and you started it on 01-01-2020 (January 1st), so it will run for 11 months. Now if you start on 01-01-2019, it would run for 23 months. Have you checked if your times in the ngen realization and noah-owp-modular input file are consistent?

yuqiong77 commented 10 months ago

@ajkhattak Yes, the start and end times in my ngen realization and noah-owp input files were consistent. After many run experiments, I have come to realize that the run would always crash after running for 9 to 10 months, rather at a fixed time stamp. For example, if the model crashes at 07/01/2013 during a first run that starts from 10/1/2012, it would run pass 07/01/2013 in a second run that starts from 06/01/2013 and then crashes around a new time 2014/04/01. Looks as if there were errors accumulating the model as the run progresses until it reached a break point ...

ajkhattak commented 10 months ago

@yuqiong77 ah I see, sorry I overlooked your text, so the time it crashes is not exact. But what if your start time is fixed, will it crash at the same time every time you rerun? Anyway, does it happen for all catchments in your basin or just one catchment? If we can reproduce this problem on a single catchment, we can debug it easily

yuqiong77 commented 10 months ago

@ajkhattak I'm actually not sure at which catchments this problem has occurred. It is not obvious to me from the standout messages, since no specific catchment names are mentioned there. Note I'm trying to run ngen for the entire HUC-01 region, which has > 20000 catchments.

ajkhattak commented 10 months ago

@yuqiong77 I understand it. Nextgen team might be able to help debug -- at least identify the catchment that is causing the problem. @hellkite500 do we have any verbosity options in the framework that we can set to screen output some metadata about the simulation state (timestep, catchment ID, model call, etc.)?

aaraney commented 10 months ago

@yuqiong77, in your noaa owp namelist files, what setting did you use for dynamic_veg_option?

SnowHydrology commented 10 months ago

@yuqiong77 Can you share an example of the namelist you're using?

yuqiong77 commented 10 months ago

@aaraney dynamic_veg_option is 4 in the noaa owp namelist files, which were from @robertbartel and I only changed the start and end times at the beginning of the files. @SnowHydrology an example of the namelist file is attached.

NoahOWP_cat-17811.namelist.txt

aaraney commented 10 months ago

@yuqiong77, I dont have an answer as to why things are breaking. However it looks like something is going awry in the calculation of net longwave radiation (w/m2) values IRG, IRB or IRC.

We know that FIRA is getting set like this (source):

energy%FIRA  = parameters%FVEG * energy%IRG + (1.0 - parameters%FVEG) * energy%IRB + energy%IRC

And we know that FIRE = forcing%LWDN + energy%FIRA or from the output FIRE = 370.0 + -3049.29419. So, FIRA is the issue, not the forcing.

From your error output, we know FVEG (SHDFAC in output) and the FIRA calculation so we know:

-3049.29419 = 0.69 * energy%IRG + (1.0 - 0.69) * energy%IRB + energy%IRC

So, something is going wrong in one or more of the net longwave radiation calculations. This is not my domain expertise, so that may be helpful or it might not be. @ajkhattak, does that mean anything to you?

SnowHydrology commented 10 months ago

@aaraney That is a mostly correct interpretation of what's happening. A few addendums: