NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I want an improved netcdf format (netcdf3) #280

Open epag opened 3 months ago

epag commented 3 months ago

Author Name: James (James) Original Redmine Issue: 97121, https://vlab.noaa.gov/redmine/issues/97121 Original Date: 2021-10-05


Given a @netcdf2@ format that has some weaknesses (because it attempted to straddle various competing objectives at the time) When I consider how to improve it Then I want to consider a @netcdf3@ format

Specific enhancements to be listed.


Redmine related issue(s): 103076


epag commented 3 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-10-05T15:35:57Z


Must not reduce compatibility when compared w/ netcdf2. Must still work in a recent gdal version and hence ots tools like qgis.

Should probably target cf 1.8.

Need to add wkt geometries for one. Would allow us to represent feature groups properly, as well as other more complex geometries. Would be nice to have one blob, not many. Might be nice to add all statistics, but this is probably not straightforward for an array-formatted blob, so nice-to-have, not essential.

epag commented 3 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-10-05T15:38:43Z


Anyway, add yer wishlist here.

One of the nice things w/ netcdf when compared to csv2 is that it's a lot less verbose, so I think it has an ongoing user base. It might make more sense to use csv2 in data-frame-shaped applications, but netcdf is a nicer format in many ways for geospatial applications.

Perhaps, one day, we'll have one format that rules them all (edit: user facing, I mean, we already have our canonical format), but I doubt it (because there is a proliferation of geospatial and time-series formats more generally, this is not a wres thing). Perhaps netcdf3 could be a further step along the way, though.

epag commented 3 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-10-05T16:00:51Z


  1. Single blob
  2. Geographic interoperability with recent GDAL (and therefore other tools)
  3. Accurate and precise modeling
  4. Recent-ish CF-conventions adherence
  5. Less cruft
  6. More metadata

Those in order. In other words, if there is a conflict between CF-conventions and interop, interop takes priority.

Edit: I reversed the order of modeling and CF conventions, split "less cruft" into its own item.

epag commented 3 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-10-05T16:22:54Z


Jesse wrote:

  1. Single blob
  2. Geographic interoperability with recent GDAL (and therefore other tools)
  3. Accurate and precise modeling
  4. Recent-ish CF-conventions adherence
  5. Less cruft
  6. More metadata

Those in order. In other words, if there is a conflict between CF-conventions and interop, interop takes priority.

Edit: I reversed the order of modeling and CF conventions, split "less cruft" into its own item.

Sounds good to me. The reason for data standards/convention is, in any case, to increase interop, so if the cf convention fails in some way, always side on improved interop for our user base.

epag commented 3 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-10-05T16:24:20Z


( #97121 in terms of item 6, more metadata. )

epag commented 3 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-10-05T16:26:09Z


Another thing that would be really nice to fix (but might be hard, I forget - edit: so, I'm not sure if this is bound up in format and hence within scope or tools and hence out-of-scope) is the delayed structure identification. It is a massive pita for our pipeline to bring forward the structure identification before statistics write time (versus incrementing a structure as statistics arrive). edit: that is to say, it makes netcdf a special snowflake among statistics formats, which is never good.

epag commented 3 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-10-06T11:43:23Z


Not a feature, but:

  1. Some unit/integration tests.

We can use an in-memory filesystem for this. There are examples for other format writers, like csv2. Essentially, write the file to an in-memory filesystem, then read some or all of it and make assertions against expectations. Would be nice to not rely on reading (esp. for netcdf which cannot be done with a jdk one-liner like csv2), but there is no way around that as a means of establishing what was written.

epag commented 3 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-26T11:03:01Z


Variable naming is another area for improvement. In netcdf/netcdf2, we qualify the variable names with metadata, which leads to friction when adding newly qualified slices of statistics. The attributes of a variable should fully qualify the statistics within it. A more general naming convention should be adopted for the variables, avoiding threshold and other information and perhaps even the metric name, although this may be helpful for a human user who is trying to visually filter slices in a GIS or some other visualization tool and find the one they want.

epag commented 3 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-26T11:07:23Z


edit: oops, wrong thread, ignore.

-On building, there's a small number of unit test failures to deal with...-

-For the system tests, scenario003 will fail on assertions, since the graphics titles are now additionally qualified with the ensemble average type, where applicable, and scenario003 is an ensemble evaluation with all valid metrics and graphics benchmarks. I don't anticipate other failures.-