As a developer, I want a simpler mechanism for specifying NWM netCDF data to be read than the current interfaces - Githubissues

epag commented 2 months ago

Author Name: Hank (Hank) Original Redmine Issue: 124687, https://vlab.noaa.gov/redmine/issues/124687 Original Date: 2024-01-03

Feel free to reword the subject; I probably rushed a bit in writing that. We would like to simplify or remove the @interface@ options for NWM netCDF sources. Design of a solution still needs to be discussed, and this ticket can be resolved once that simpler solution is implemented and ready for deployment.

Marking this high, but not urgent, and placing in the backlog.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-01-03T13:37:16Z

From James in #121139:

Separately, we need to reconsider interface shortands as a way of policing a structural expectation on NWM data. Why is the NWM a special snowflake among our data sources? We don't police data structures from other models/sources, we simply read what we find and assemble it. I think the upside of policing the NWM structure is pretty small - basically a bunch of warnings when the expectation is not met - but the downside is quite large, namely a proliferation of interface shorthands to capture the vast number of dimensions used to describe each structure. If these structures are going to change between model versions, we'll end up with a completely unusable list of shorthands and that is arguably the case already. Perhaps I am forgetting some other reasons for the upfront identification of this structural expectation beyond policing missing data, but it should be possible to design something more adaptive that does not rely on a directory structure that is known upfront. There are other parts of our code-based where we assemble (e.g., ensemble) time-series from multiple sources. It should be possibly to do this ex-post, rather than correlating sources with time-series upfront.

I then asked:

Before we create that bigger ticket, I want to make sure I understand the implications of your comment. Specifically, if we do not "police" the data structure (which I assume also means interpreting the file names), then, presumably, the WRES will scan the directory/path to which it is pointed and examine all of the files beneath it in order to identify data to be ingested and ingest that data. Is that correct?

One upside to that policing is performance. By knowing the directory structure and file name format up front, we can ensure that files not within the issued/reference date range are not examined. For example, the first directory underneath the top-level 2.2/3.0/whatever directory is the reference date; for example:

https://[D-Store]/nwm/3.0/nwm.20231204/

If we don't assume a directory structure under 3.0, then we will end up looking at every 3.0 archived netCDF file when scanning the directory structure for data to read, unless the user makes judicious use of file path/name pattern matching. This will not only include files outside of the period of interest, but also undesired data types. Yes, I understand that the WRES can just look at the netCDF headers to make that determination, but that is an awful lot of netCDF headers that will need to be examined.

He replied,

We can exploit reference datetimes in directory names without an interface shorthand. Reference datetimes are not part of an interface shorthand. Rather, the reference datetimes are a common part of the nwm directory structure naming and will be present forevermore. In other words, I am not proposing to make no assumptions about nwm directory structures, although we can probably even detect those rather than assume them, rather I am proposing to avoid the brittle interface shorthands.

The problem with interface shorthands is that they abstract too many of the dimensions that change often, like the number of members. We should be able to take a declaration with (e.g., reference date) constraints and find the data we want, no problem, without requiring a user to declare an interface shorthand.

I'll reply in my next comment,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-01-03T13:50:57Z

First, here is a complete example of the path to a specific netCDF file to provide context:

https://nwcal-dstore.[host]/nwm/3.0/nwm.20231204/medium_range_alaska_mem1/nwm.t00z.medium_range.channel_rt_1.f001.alaska.nc

I guess it was hard for me to tell how far you wanted to go with this. For example, you said,

If these structures are going to change between model versions, we'll end up with a completely unusable list of shorthands and that is arguably the case already.

That implies, to me, that we can't make any assumptions about the directory structure since any part of it could change at any time. And, yes, that includes the date portion of the structure; e.g., why is "nwm." part of the date? Certainly that prefix could be dropped. However, it appears that you are willing to make some assumptions:

In other words, I am not proposing to make no assumptions about nwm directory structures, although we can probably even detect those rather than assume them, rather I am proposing to avoid the brittle interface shorthands.

That sounds reasonable: assume a basic directory structure, auto detect what we can, rely on the declaration for the rest.

Here is the basic directory structure:

[base URL]/[nwm.<reference date>]/[type of forecast or simulation, possibly with members]/[netCDF with specific naming convention]

The naming convent for files is something like the following:

nwm.[reference date hour of day in Z].[forecast type].[data type].[lead time].[additional qualifier].nc

What parts of that can be autodetected? What needs to come from the user?

I just noticed some other comments posted to the other ticket, so let me scan those.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-01-03T13:52:11Z

From #121139-11

The other possibility is that we make the interface shorthands less brittle by removing all of the attributes that are likely to change between model versions, such as gaps between reference times and valid times and number of members and just detect/read what is there.

That makes sense to me.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-01-03T13:58:16Z

I think the starting point is to make them less brittle. I think we can probably go beyond this, but it would be more work. The main thing to avoid is the addition of more interface shorthands, because the list is already too unwieldy. However, including a bunch of (time and member) dimensions that are likely to change between model versions within the shorthand definition naturally leads to a proliferation of interface shorthands.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-01-03T14:01:46Z

Taking a first crack at what can be detected and what must be declared:

[base URL]/[nwm.<reference date>]/[type of forecast or simulation, possibly with members]/[netCDF with specific naming convention]

nwm.[reference date hour of day in Z].[forecast type].[data type].[lead time].[additional qualifier].nc

The base URL must be user specified via the path, as it is now.
The reference date can be detected.
The type of forecast or simulation must come from the user somehow, which implies the interface. Examples are, "analysis_assim_hawaii", "medium_range_mem1", "analysis_assim_alaska", "short_range_puertorico_no_da", etc.
Member indices can be identified by looking for a number at the end of the "type of forecast or simulation". For example, if we can determine "medium_range_mem" from the user's declaration, we can then look for "medium_range_mem#" and identify the #.

Looking at the file name:

The reference time-of-day in Z time can be detected from the file name. It is always after the "nwm." file name prefix.
The forecast type is always the same for every file in a given directory, and is largely redundant with the directory name in which the file is located.
The data type, however, can differ. For example, medium range data includes both "channel_rt_1" and "reservoir_1". That must be identified by the user somehow, perhaps being implied by the "streamflow" @variable@ @name@.
The lead time can be detected from the file name.
The additional qualifier does not provide any more information beyond what the directory name tells us.

So, to summarize, I think we need to know the type of forecast or simulation being evaluated and the variable. The rest can be parsed and interpreted. (EDIT: The user must also provide the base URL, which is already required.)

Meeting,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-01-03T14:02:59Z

I would need to look at the other fields in an NWM profile to see if there is something I overlooked. That can happen later when work begins.

Hank

NOAA-OWP / wres

As a developer, I want a simpler mechanism for specifying NWM netCDF data to be read than the current interfaces #154