SUBBASINFLAG causes NON-CONV error

mee067 commented 5 months ago

I was running the MRB setup masked for the Athabasca sub-basin. After a few years, it crashed on NON-CONV error for a grid cell outside the Athabasca domain. Is the routing still run for the whole domain? Is there a way to stop that?

Here is the error message:

WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
WARNING: NON-CONVERGENCE AT POINT AT X,Y:    248    71
route yields a negative store2 value at dtmin = mindtmin:   10.0
 It's likely that qo1 is so large that store2 is negative even with qo2=0.0
If this run was started from  shed2flowinit utility, then try lowering the QI,  QO, and STORE ratios until this error is resolved
 Else rerun with a smaller value of dtmin
slurmstepd: error:  mpi/pmix_v4: _errhandler: cnic-giws-cpu-19004-02 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.2053117.4:0]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2053117.4 ON cnic-giws-cpu-19004-02 CANCELLED AT 2024-02-01T16:03:38 ***
srun: error: cnic-giws-cpu-19004-02: tasks 1-31: Killed
Program finished with exit code 137 at: Thu Feb  1 16:03:39 CST 2024

dprincz commented 5 months ago

This is the issue with the ACLASS-based approach. The goal of having SUBBASINFLAG re-RANK everything would solve this, but the intermediate fix just sets the land-surface component, i.e., GRUs, to zero to disable the HLSS. It was originally implemented this way recognizing the bulk of the time-step was processed in CLASS. The routing still runs over the full domain.

mee067 commented 5 months ago

I understand. The aim was to reduce run time with minimal effort. Extracting the ddb, distributed parameters, and forcing for the sub-basin in question would require some work, and the scripts we have handle grid-based setups. Reducing the run time would be sufficient if the run would go but if it crashes, then it's not useful.

But do we need to re-rank all gridcells/subs? Can't we mask out the same grids/subs by setting their Rank to zero such that the routing does not run for those cells?

dprincz commented 5 months ago

You could check if route.f, rerout.f and flowinit.f contain sanity/safety checks to skip cells where DA==0.0, which could be used to similarly mask out areas without having to re-RANK/reconfigure the domain.

mee067 commented 5 months ago

I checked the code and there is no such check, but it can be added if that's the way and setting it along with the GRU fractions. Why do you prefer setting DA=0 over setting RANK = 0, or NEXT = 0? The routing code will not run when NEXT = 0, right? It is already implemented.

dprincz commented 5 months ago

In general, adding NEXT=0 within stride would cause indexing errors. It won't have problems if it skips loops where NEXT==0. If it doesn't explicitly skip a cycle where NEXT==0, then it will introduce 'index out of bounds' errors throughout WATROUTE. Whether by DA or NEXT, we'd have to check the loops will cycle in those conditions (e.g., similar to the cycle statements that were added to each loop in WATDRN.

mee067 commented 5 months ago

I think the issue might be limited to lakes. The point that caused the issue is just d/s of a lake outlet. The lake outflow is negative from the start of main simulation. No sure why it only crashed after a while. If it is only related to lakes, one could remove them but that requires editing the ddb and the reservoir file. The idea of this flag is to use everything as is, except the output gauge locations that are used for masking. Further editing invalidates the purpose, in my view.

Either way, I will try to implement what you mention above. I will use DA to be safe. In basin_utilities, I will set it to zero for all points outside the basin and then add conditions in the routing code and see how it goes.

mee067 commented 4 months ago

I implemented it using DA and it tested successfully for the Athabasca masking the rest of the MRB out including all its lakes. I pushed the updates to my code branch at: https://github.com/MESH-Model/MESH_Code/tree/master/r1860_ME

I hope this does not disrupt your current merging work

MESH-Model / MESH-Dev

SUBBASINFLAG causes NON-CONV error #33