Failure running t-route in ngen worker image

robertbartel commented 9 months ago

Attempts to run framework-integrated t-route execution are failing. Initially, these were encountering a segmentation fault. After some experimental fix attempts, the errors changed first from to a signal 6, then to a signal 7, but t-route still does not run successfully.

The initial suspicion was a problem related to a known NetCDF Python package issue, which is what early fix tries attempted to address (this may still be the root of what's going on).

ajkhattak commented 8 months ago

I think I agree with Keith. I was also thinking about this huge temperature gradient, but that maybe less of an issue here. I wonder if the the error will go away (or happen at a different time) if @yuqiong77 changes the dynamic_veg_option from 4 to 1

1 -> off (use table LAI; use FVEG = SHDFAC from input)
4 -> off (use table LAI; use maximum vegetation fraction)

yuqiong77 commented 8 months ago

Thanks all. I was checking my namelist files again and realized that the forcing_filename & output_filename are not set correctly. Take NoahOWP_cat-12602.namelist as an example. Currently the file has:

forcing_filename = '.'
output_filename = '.'

Since my running directory is /local/model_as_a_service/yuqiong and my forcing data is actually at /local/model_as_a_service/yuqiong/data/AORC/csv_files, should I set the following:

forcing_filename = '/local/model_as_a_service/yuqiong/data/AORC/csv_files/cat-12602.csv'

Not sure about the "output_filename". Should I set it to a catchment specific file name as well, e.g.:

output_filename = '/local/model_as_a_service/yuqiong/noahowp_output/out_cat-12602.csv'

yuqiong77 commented 8 months ago

Also, a quick question: does the output from Noah-OM get passed to modules like CFE and Topmodel via the BMI? If so, what is output_filename (as defined in the namelist file) used for?

ajkhattak commented 8 months ago

> > forcing_filename = '.'
> > output_filename = '.'

I don't think they matter if you are running in the nextgen framework, you could even point them to any fake or real file as they are not read/used (but again when running in the framework).

Naoh-OM does provide precipitation and potential_ET as inputs to CFE (and topmodel too) via BMI, however, the output_filename is not used, I would guess this file is not even generated, when running Noah-OM in the nextgen framework. @SnowHydrology can confirm this or correct me if I said something that does not make sense 😊

SnowHydrology commented 8 months ago

@ajkhattak is correct. You could put any string you want in those two entries when running in NextGen. We put compiler directives to skip over the forcing read and output write routines. E.g. https://github.com/NOAA-OWP/noah-owp-modular/blob/04e8ac02532c9a292098f974cdb03aa03bfbfcd6/src/RunModule.f90#L210

yuqiong77 commented 8 months ago

@ajkhattak @SnowHydrology That's great to know! I was checking the source code NamelistRead.f90 and noticed that forcing_filename & output_filename inputs were required, but I did not dig deeper to see how those file names get used in other subroutines.

SnowHydrology commented 8 months ago

@yuqiong77 Were you ever able to track down the exact time and location (basin ID) of the failure? I'd be interested to see the forcing data corresponding to the failure just in case there is anything interesting in the file.

yuqiong77 commented 8 months ago

@SnowHydrology No, I have not been able to track down the exact time and location of the failure. The screen output or error message did not indicate any catchment ids. What makes the debugging difficult is that the error would only occur after 9 to 10 months into the run, which would take close to 20 hours clock time (in the serial mode, since at the moment running in the parallel mode would produce spurious lines in the ngen output). I'll try to dig a bit deeper to see if I can identify the catchment of the failure.

SnowHydrology commented 8 months ago

@yuqiong77 that's likely because the error print out is coming from Noah-OM, which doesn't know which catchment it's running in. Maybe the output files can indicate where Noah-OM failed?

SnowHydrology commented 8 months ago

Also tagging @GreyEvenson-NOAA here.

The Noah-OM issue is described here originally: https://github.com/NOAA-OWP/DMOD/issues/472#issuecomment-1893759701

yuqiong77 commented 8 months ago

Some progress on identifying problematic catchments ... For 2933 catchments, the ngent outputs contain nan values from the very first time step, e.g.,

TimeStep,Time,RAIN_RATE,DIRECT_RUNOFF,GIUH_RUNOFF,NASH_LATERAL_RUNOFF,DEEP_GW_TO_CHANNEL_FLUX,Q_OUT,POTENTIAL_ET,ACTUAL_ET,GW_STORAGE,SOIL_STORAGE,SOIL_STORAGE_CHANGE,SURF_RUNOFF_SCHEME 0,2013-10-01 00:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000087867,0.000000000,-nan,0.000000000,0.000000000,1.000000000 1,2013-10-01 01:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000076420,0.000000000,-nan,0.000000000,0.000000000,1.000000000 2,2013-10-01 02:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000065125,0.000000000,-nan,0.000000000,0.000000000,1.000000000 3,2013-10-01 03:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000033762,0.000000000,-nan,0.000000000,0.000000000,1.000000000 4,2013-10-01 04:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000032287,0.000000000,-nan,0.000000000,0.000000000,1.000000000 5,2013-10-01 05:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000051751,0.000000000,-nan,0.000000000,0.000000000,1.000000000 6,2013-10-01 06:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000046923,0.000000000,-nan,0.000000000,0.000000000,1.000000000

I checked the forcing files of these catchments and didn't find anything suspicious ... Will keep digging and report back

SnowHydrology commented 8 months ago

@yuqiong77 are you also saving the Noah-OM outputs? That would help with diagnosing any issues.

yuqiong77 commented 8 months ago

With help from @ajkhattak , I think we have found the issue. The parameter values in the CFE config files for those problematic catchments were not set correctly. Likely there was a bug in the script that I used to populate the parameter values from the regionalization.

ajkhattak commented 8 months ago

but we still need to dig deeper to investigate it further, unless I am missing something, I don't think that wrong CFE config file inputs caused the Noah-OM FIRE = forcing%LWDN + energy%FIRA error as the coupling between CFE and NOM is not two-way

aaraney commented 8 months ago

@yuqiong77 / @ajkhattak was the supposed config issue with the dynamic_veg_option parameter?

yuqiong77 commented 8 months ago

@aaraney I doubt dynamic_veg_option would be an issue. I had successful runs before with dynamic_veg_option set to 4 (same as what we're using now).

yuqiong77 commented 8 months ago

Hi all, just wanted to let you know that Ahmad has been helping me debugging and he was able to run ngen successfully for a year with my realization config (CFE + Noah-OM) and BMI files for HUC-01 catchments. He was also able to run successfully without CFE. He carried out both runs outside of the container. @ajkhattak if I miscommunicated or missed something, please correct.

But all of my runs (with or without CFE) within the container failed at around 9-10 months, with the same error (negative FIRA) in Noah-OM. So I'm wondering if the issue has something to do with the image that @robertbartel help build, in particular related to the Noah-OM module contained in that image?

aaraney commented 8 months ago

Thanks for reporting back @yuqiong77! I was afraid it would be difficult to diagnose. Unfortunately it could be a myriad of things from the version of the Noah-OM code @ajkhattak used, the complier (gcc vs clang), the optimization level used by the compiler, or even the CPU architecture. @ajkhattak for starters, did you run the experiment on an arm or x86 machine?

SnowHydrology commented 8 months ago

@yuqiong77 Thanks for this update. The error message you got is one of the few checks in Noah-OM that will stop the model. Although the error may manifest as emitted longwave <0; skin T may be wrong due to inconsistent input of SHDFAC with LAI, it can be caused by myriad issues.

robertbartel commented 8 months ago

@yuqiong77, thank you for the info. Just to confirm, were your runs always with serial ngen, or did you also experience the errors running parallel ngen? If you haven't tried a parallel ngen scenario because of the current issues with that and t-route, could you try your configs in a parallel run (with routing removed of course) and see if the error still occurs?

yuqiong77 commented 8 months ago

@robertbartel yes, all my latest runs that failed were in serial mode. My earlier runs in the parallel mode did not go far because of the t-route issue we ran into. I will launch a parallel run without routing for a year and report back.

yuqiong77 commented 8 months ago

@robertbartel The parallel version ran pretty fast, but unfortunately it still failed at around 7300 time steps with the same error.

Running timestep 7300 emitted longwave <0; skin T may be wrong due to inconsistent input of SHDFAC with LAI 2147483647 2147483647 SHDFAC= 0.800000012 parameters%VAI= 4.74794531 TV= 286.935242 TG= -110.499847
LWDN= 366.299988 energy%FIRA= -15046.4004 water%SNOWH= 0.00000000
Exiting ...

ajkhattak commented 8 months ago

sorry guys, there were some other issues. The use of STOP in Noah-OM terminates the problem normally (stops the execution and sends out ZERO to the terminal). I replaced STOP with call ABORT so Noah-OM terminates abnormally, and then my workflow can catch this abnormal behavior, and will throw the problematic catchment ID, so the first such catchment I see is cat-2573.

I am going to test it on the latest Noah-OM master and see if I can reproduce the error.

@GreyEvenson-NOAA I will reach out to you to discuss the debugging further

sorry for any confusion...

GreyREvenson commented 7 months ago

Afternoon all,

I spent some time looking for a problem in the energy balance simulations and the calculation of vegetation temperature and ground (below veg) temperature in EnergyMain and EtFluxModule but didn't find anything.

However, I noticed that in the namelist file that Ahmad gave to me, the soil type is specified as 14, which corresponds to 'water'. The simulation ended successfully -- and with realistic ground temp values -- after changing the soil type to something different (I tried several different non-water soil types). Can someone confirm my observation by changing isltyp to 13 or something else and re-running?

@yuqiong77: Does this catchment need to be simulated with a water soil type? If so, I will look into the matter further as the energy and temperature simulations are partly impacted by the properties of the top soil horizon.

SnowHydrology commented 7 months ago

@robertbartel, this issue might be close-able. @ajkhattak and @GreyEvenson-NOAA tracked down the issue in the Noah-OM namelist and we're working on a fix in the hydrofabric.

Actually, I just noticed, this particular issue has had quite the evolution, so I don't know if the original issue has been solved. The Noah-OM error has been.

robertbartel commented 7 months ago

Thanks @SnowHydrology. The scope did get pretty broad, but I think you are correct in that this can be closed. To be safe though, I want to outline what had been uncovered, and status of addressing that aspect:

The original image dependency and build issues, related to NetCDF Python package
- Fixed via #474
Failure running a parallel modeling job with t-route
- Not directly a DMOD issue
- Turned out to be a subtle problem with ngen and how it writes output files that throws of t-route
- Can be worked around by running ngen serially
Failure running any modeling jobs when time range reaches certain length
- Not directly a DMOD issue
- Seems like choice of configured soil type was contributing to unstable behavior of Noah-OM

@aaraney, @yuqiong77, @ajkhattak, is this all correct? Have I missed anything?

hellkite500 commented 7 months ago

@ajkhattak would you be willing to document/describe the workflow on this ngen issue? noaa-owp/ngen#723 I've been thinking about various ways to catch library exists and propagate errors through the model engine stack to capture additional information, and it sounds like you have done something that may be useful in helping formalize a mechanism in the model engine to provide these details.

NOAA-OWP / DMOD

Failure running t-route in ngen worker image #472