NOAA-OWP / snow17

Other
4 stars 10 forks source link

Unit number conflict for large number of catchments #42

Open drakest123 opened 1 month ago

drakest123 commented 1 month ago

When the BMI-enabled version of Snow-17 is initialized with many catchments (e.g. more than 44) there is a unit number conflict when opening an input file.

Current behavior

The program crashes.

Expected behavior

The subject of this post.

Open forum

The issue is that files remain open during a Snow-17 run, which limits the number of catchments that can be processed in a given run. Background for this issue is from an email from Andy Wood:

"My original code setup for those models (pre-BMI) looped through each zone (eg U, L) and opened/closed the files before moving to the next zone, since as you say, they don't interact. When I refactored it to use BMI, I changed it to open all the files in the initialize step and close them all in the finalize step, and I didn't think about the upper limits. I/we could easily change the numbering scheme for the files to enable it to keep many 1000s of files open, which most machines would support, and which would enable a reasonable large (but not infinite) standalone run case. And when run in nextgen, snow17 should be getting forcings from the framework and not opening its files. The current numbering scheme envisions running a basin at a time, and the basin might have some number of elevation zones but probably never more than 20 (in RFC world the max is about 3).

An alternative might be tricky – given the way the update() function works. It would be inefficient to have that function re-open all the files and close them just to read just the forcing for a single timestep. If all the forcings are in one netcdf file, instead of individual csvs, then that could simplify the problem. Is the reason that noah-om doesn't have this issue because it doesn't try to run sub-catchment level zones? (or basin sub-catchments)? I think all the noah-om dev. work ran standalone over single catchments (ie one forcing file per catchment)."

Possible alternatives:

  1. No code changes other than an error catch if the number of catchments exceeds the current number that would produce a unit number conflict when opening a file
  2. Reformulate the code to allow for many more open csv files
  3. Put all forcings into a single netcdf file
  4. Close each file after reading data for the current time interval

Alternative discussion:

  1. If Snow-17 is executed using a single catchment at a time, then the only change we need to make is to check for multiple catchments and either error out or warn of a unit number conflict. This method also scales well because we avoid the problem of running out of open file handles or exceeding memory constraints if an arbitrary number of catchments are processed in a single run.
  2. If it is the case that we could possibly have 1000s of files open at a single time then this option is worth discussing. One consideration is that it seems problematic design-wise and the number of possible file handles may be operating system dependent. It also does not scale.
  3. Putting all forcings into a single netcdf file avoids the problem of many open files but complicates formatting the forcings input. If the file remains open then, it seems to me, that the data from catchments would need to be interleaved, which would make for a complicated input file format. If, instead, direct access is used then we’d need to come up with an algorithm of tracking record numbers. In either case, this solution would not scale, memory-wise.
  4. Similar to (3) this solution would require direct access of the input file and tracking record numbers. Also, as you mention, it would be computationally expensive. In this case we may need to track multiple record numbers (although they’d likely be the same :] ) and it would not scale for an arbitrarily large number of catchments.

I don’t know the answer to your question about whether noah-om was tested using single catchments but that does seem to be a critical point. If it is advisable to run Snow-17 a single catchment at a time in the noah-om context then the file I/O issues become less consequential.

andywood commented 1 month ago

Good discussion of the issue, Steve. One thing to note is that this mainly refers to the standalone use case (ie not run through nextgen). The file opening/reading is skipped by a compiler directive when run in nextgen. Even so I think it's worth making a quick fix on file number ranges to allow for opening many more files -- I'm not sure what a reasonable standlone use case limit is, versus the nextgen application case. If this is an edge case, use wise, the priority to go beyond that fix might be low.

SnowHydrology commented 1 month ago

@drakest123 To confirm, we're talking here about forcing and output files when running Snow-17 in standalone mode? I believe you also mentioned an issue caused in NextGen by the parameter file staying open. The former can likely be "fixed" by adding the check you note in option 1 above. The latter should be fixed by closing the parameter file after reading it in.

drakest123 commented 1 month ago

Considerations

  1. Snow-17 runs within NGEN execute a single catchment at a time. This method processes each catchment, however, it is slow and does not support code optimization to process many catchments simultaneously.
  2. The number of input forcing files could become unwieldy if there is a strict one-to-one correspondence between forcing files and catchments.
  3. The Fortran specs indicate the one can open as few as 100 files at a time (file units 0-99) and some of these file units are dedicated to standard I/O (e.g. unit numbers 4,5,6). The number of files one can open at a time is implementation specific.
  4. One reference indicated that you can change #define MXUNIT 100 and recompile the Fortran compiler to increase the number of files that can be opened at a time. However, there is also an OS limitation. On this Mac M1 the limitation is 256:
    % ulimit -n
    256

Suggested solutions

  1. Write a function/subroutine to check for open unit numbers and dole out an unused unit number starting at 10 to avoid the OS range. Avoid unit numbers that are explicitly utilized by the program (e.g., 51) or change the code such that all unit numbers are doled out.
  2. Error check for requests that exceed an established number of catchments, e.g., 80.