NOAA-ORR-ERD / gridded

A single API for accessing / working with gridded model results on multiple grid types
https://noaa-orr-erd.github.io/gridded/index.html
The Unlicense
64 stars 14 forks source link

Does gridded write compliant UGRID netcdfs? #65

Open groutr opened 3 years ago

groutr commented 3 years ago

After manipulating the mesh in gridded, saving the resulting grid doesn't seem to produce a valid UGRID formatted netcdf. RGFGRID doesn't seem to know what to do with the file.

I noticed that a netcdf output by RGFGRID (in UGRID format) look like:

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF3_CLASSIC data model, file format NETCDF3):
    institution: Deltares
    references: http://www.deltares.nl
    source: RGFGRID 5.05.00.59149. Model: ---
    history: Created on 2021-03-18T13:12:52-0500, RGFGRID
    Conventions: CF-1.6 UGRID-1.0/Deltares-0.8
    dimensions(sizes): Two(2), nmesh2d_edge(188), nmesh2d_node(73), nmesh2d_face(116), max_nmesh2d_face_nodes(3)
    variables(dimensions): int32 mesh2d(), int32 projected_coordinate_system(), float64 mesh2d_node_x(nmesh2d_node), float64 mesh2d_node_y(nmesh2d_node), float64 mesh2d_node_z(nmesh2d_node), float64 mesh2d_edge_x(nmesh2d_edge), float64 mesh2d_edge_y(nmesh2d_edge), float64 mesh2d_edge_x_bnd(nmesh2d_edge, Two), float64 mesh2d_edge_y_bnd(nmesh2d_edge, Two), int32 mesh2d_edge_nodes(nmesh2d_edge, Two), int32 mesh2d_face_nodes(nmesh2d_face, max_nmesh2d_face_nodes), int32 mesh2d_edge_faces(nmesh2d_edge, Two), float64 mesh2d_face_x(nmesh2d_face), float64 mesh2d_face_y(nmesh2d_face), float64 mesh2d_face_x_bnd(nmesh2d_face, max_nmesh2d_face_nodes), float64 mesh2d_face_y_bnd(nmesh2d_face, max_nmesh2d_face_nodes)
    groups: 

whereas the file output by gridded looks like:

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes): mesh_num_node(73), mesh_num_edge(193), mesh_num_boundary(28), mesh_num_face(116), mesh_num_vertices(3), two(2)
    variables(dimensions): int32 mesh(), int32 mesh_face_nodes(mesh_num_face, mesh_num_vertices), int32 mesh_edge_nodes(mesh_num_edge, two), int32 mesh_boundary_nodes(mesh_num_boundary, two), float64 mesh_face_lat(mesh_num_face), float64 mesh_face_lon(mesh_num_face), float64 mesh_edge_lat(mesh_num_edge), float64 mesh_edge_lon(mesh_num_edge), float64 mesh_boundary_lat(mesh_num_boundary), float64 mesh_boundary_lon(mesh_num_boundary), float64 mesh_node_lon(mesh_num_node), float64 mesh_node_lat(mesh_num_node)
    groups: 

Noted differences are:

  1. The conventions field is missing in the gridded output.
  2. Dimension names are different. The Deltares netcdf appears closer to the spec that is detailed here: https://ugrid-conventions.github.io/ugrid-conventions/
  3. Variables names differ as well.

Does gridded produce compliant UGRID netcdf files? Reading the documentation, it's never explicitly stated that the outputs conform to any standard. I simply expected the GridU save method to output something that was UGRID.

ChrisBarker-NOAA commented 3 years ago

The intent is to produce UGRID compliant files, yes.

But that code hasn't had any attention in a long time -- since before UGRID hit 1.0.

PRs accepted:-)

As to what you've noticed, the names of the dimensions and variables are arbitrary.

So the only issue here may be the conventions attribute.

But if you notice other issues, please let us know.

groutr commented 3 years ago

@ChrisBarker-NOAA I'm happy to put together a PR. I'll try adding the conventions attribute and see if RGFGRID can read that first. Since I'm not very familiar with the UGRID standard I might need some guidance on some parts of the spec if more indepth changes are required.

ChrisBarker-NOAA commented 3 years ago

thanks!

groutr commented 3 years ago

@ChrisBarker-NOAA One my meshes challenges some of the assumptions made by the GridU class. The GridU class doesn't seem to like meshes with mixed faces (you end up with a masked array which breaks the edge and face handling logic in the class).

In refactoring some of the methods, I think it would be better to store edges and faces as sets instead of numpy arrays. This would avoid a lot of the issues I'm having with ragged arrays. I can provide methods for converting edges and faces to numpy arrays. For ordering purposes, edge and face vertices would be stored in sorted order (though that may not necessarily mean CW or CCW).

ChrisBarker-NOAA commented 3 years ago

hmm.

You are quite right, we were vaguely thinking of supporting mixed faces, but never actually tested anything. So we welcome help with that.

However, I'm wary of going to a non- numpy solution -- numpy arrays buy you a lot.

The UGRID netcdf standard specifies using a flag value (or mask) to handle mixed faces, so we just assumed we'd do that int he UGRid class as well.

But you are right, numpy doesn't have a ragged array concept built in in any way, but that's what we really want here.

My first choice would be to make a ragged array class -- I"ve done some work on that in the past, but never finished it, but that would buy us something that would act like a numpy array as much as possible.

Thinking out loud, for this use case, maybe not a true ragged array, but rather, a kind of custom masked array for this use case. That is, the information would be stored as it is now, with a mask or flagged value for the "null" nodes. But when you accessed an element, you'd get a correct sized array back. Lots of details to be worked out of course.

Another option would be to make an array-like object with another data structure underneath actually storing the information.

But why a set? sets are optimized for fast membership testing, but they are unordered, which I think would make it a challenge for this use case. That is, no way to find out what the Nth face is.

Anyway -- you've thought about this more than I have, so I may be thinking about it incorrectly.

I think the way to proceed is to get a manageable sized test case, and some tests, and then you can prototype a solution and see how it goes.

ChrisBarker-NOAA commented 3 years ago

BTW: with regard to UGRID compliance, I think we're not far off. However, what we don't have is a way to maintain / specify dimension or variable names, which maybe we should.

That is, if you load a grid from a NetCDF file, and save it out again, it would be nice if all the names weren't changed on you :-)

I believe the one name it is currently supposed to preserve is the name of the mesh -- but that's it.

groutr commented 3 years ago

@ChrisBarker-NOAA I realize it might be a big step to move away from numpy. I was kinda thinking aloud too. I'm still in the process of learning how meshes are stored and computed on. As I've poked around some more, I've realized order in very important.

Reading the spec, I noticed this:

The case of a 2D mesh with mixed face sizes is identical to the 2D triangular mesh discussed above with the exception that not all >faces have the same number of nodes. To support this variability we may use in the future a ragged array, but here we propose to >use _FillValue to indicate faces with smaller number of nodes than the arrays allow.

Perhaps, the fill value route is doable. It would be nice if the heavy looping could be ported to cython to minimize the performance hit. I don't know of any array libraries that allow ragged arrays off the top of my head. Maybe a list of arrays? The NetCDF4 library is handing back masked arrays already. The issues arise because the masked values are not hashable and we put them into sets and dictionaries when processing edges and faces.

groutr commented 3 years ago

Regarding UGRID compatibility, the following produces files that load in QGIS.

def save_ugrid(filepath, grid):
    mesh_name = grid.mesh_name
    with netCDF4.Dataset(filepath, 'w') as ds:
        ds.Conventions = "CF-1.6 UGRID-1.0"
        # create dimensions
        node_dim = ds.createDimension(f"{mesh_name}_num_node", len(grid.nodes)) 
        ds.createDimension("two", 2)

        mesh = ds.createVariable(mesh_name, IND_DT, dimensions=())
        mesh.cf_role = "mesh_topology"
        mesh.long_name = "Topology data of 2D unstructured mesh"
        mesh.topology_dimension = 2
        mesh.node_coordinates = "{0}_node_lon {0}_node_lat".format(mesh_name)
        mesh.node_dimension = node_dim.name

        if grid.edges is not None:
            ds.createDimension(f"{mesh_name}_num_edge", grid.edges.shape[0])
            mesh.edge_node_connectivity = f"{mesh_name}_edge_nodes"
            #if grid.edge_coordinates is not None:
            #    mesh.edge_coordinates = " ".join((_edge_lon, _edge_lat))

            _dims = (f"{mesh_name}_num_edge", "two")
            edge_nodes = ds.createVariable(f"{mesh_name}_edge_nodes", IND_DT, dimensions=_dims, fill_value=999999)
            edge_nodes[:] = grid.edges
            edge_nodes.cf_role = "edge_node_connectivity"
            edge_nodes.long_name = ("Maps every edge to the two "
                                    "nodes that it connects.")
            edge_nodes.start_index = IND_DT(0)
            edge_nodes.mesh = mesh_name
            edge_nodes.location = 'edge'

        if grid.faces is not None:
            faces_dim = ds.createDimension(f"{mesh_name}_num_face", grid.faces.shape[0])
            faces_nodes_dim = ds.createDimension(f"{mesh_name}_num_vertices", grid.faces.shape[1])
            mesh.max_face_nodes_dimension = faces_nodes_dim.name
            #if grid.face_coordinates is not None:
            #    mesh.edge_coordinates = "{0}_face_lon {0}_face_lat".format(mesh_name)

            _dims = (f"{mesh_name}_num_face", f"{mesh_name}_num_vertices")
            face_nodes = ds.createVariable(f"{mesh_name}_face_nodes", IND_DT, dimensions=_dims, fill_value=999999)
            mesh.face_node_connectivity = face_nodes.name
            face_nodes[:] = grid.faces
            face_nodes.cf_role = "face_node_connectivity"
            face_nodes.long_name = ("Maps every triangular face to "
                                    "its three corner nodes.")
            face_nodes.start_index = IND_DT(0)

        if grid.boundaries is not None:
            ds.createDimension(f"{mesh_name}_num_boundary", grid.boundaries.shape[0])
            mesh.boundary_node_connectivity = f"{mesh_name}_boundary_nodes"

            _dims = (f"{mesh_name}_num_boundary", "two")
            boundary_nodes = ds.createVariable(f"{mesh_name}_boundary_nodes", IND_DT, dimensions=_dims, fill_value=999999)
            boundary_nodes[:] = grid.boundaries
            boundary_nodes.cf_role = "boundary_node_connectivity"
            boundary_nodes.long_name = ("Maps every boundary segment to "
                                        "the two nodes that it connects.")
            boundary_nodes.start_index = IND_DT(0)

        if grid.face_edge_connectivity is not None:
            mesh.face_edge_connectivity = f"{mesh_name}_face_edges"
        if grid.face_face_connectivity is not None:
            mesh.face_face_connectivity = f"{mesh_name}_face_links"

        _lonlats = {'lon': {'long_name': 'longitude', 'axis': 0, 'units': 'degrees_east'}, 
                    'lat': {'long_name': 'latitude', 'axis': 1, 'units': 'degrees_north'}}
        for location in ('face', 'edge', 'boundary'):
            loc = f"{location}_coordinates"
            if getattr(grid, loc, None) is not None:
                coord_vars = []
                for axis, _data in _lonlats.items():
                    name = f"{mesh_name}_{location}_{axis}"
                    coord_vars.append(name)
                    _dims = (f"{mesh_name}_num_{location}",)
                    var = ds.createVariable(name, NODE_DT, dimensions=_dims)
                    var[:] = getattr(grid, loc)[:, _data['axis']]
                    var.standard_name = _data['long_name']
                    var.units = _data['units']
                    var.long_name = f"Characteristics {var.standard_name} of 2D mesh {location}"
                    var.mesh = mesh_name
                    var.location = location
                #if location in {'face', 'edge'}:
                #    mesh.setncattr(f"{location}_coordinates", " ".join(coord_vars))

        _dims = (f"{mesh_name}_num_node",)
        node_lon = ds.createVariable(f"{mesh_name}_node_lon", grid.nodes.dtype, dimensions=_dims)
        node_lon[:] = grid.nodes[:, 0]
        node_lon.standard_name = "longitude"
        node_lon.long_name = "Longitude of 2D mesh nodes."
        node_lon.units = "degrees_east"
        node_lon.mesh = mesh.name
        node_lon.location = "node"

        node_lat = ds.createVariable(f"{mesh_name}_node_lat", grid.nodes.dtype, dimensions=_dims)
        node_lat[:] = grid.nodes[:, 1]
        node_lat.standard_name = "latitude"
        node_lat.long_name = "Latitude of 2D mesh nodes."
        node_lat.units = "degrees_north"
        node_lat.mesh = mesh.name
        node_lat.location = "node"

    print("Saved to", filepath)

For some reason adding the face and edge coordinates breaks things, and I'm not sure why.

ChrisBarker-NOAA commented 3 years ago

It looks like QGIS is using MDAL; (https://github.com/lutraconsulting/MDAL) which says it supports UGRID. But if it can't read a file with face and edge coordinates, then that's a bug in MDAL :-(

groutr commented 3 years ago

At least I get QGIS to read and render the mesh. RGFGRID (from Deltares) doesn't load the file correctly either. Perhaps RGFGRID might be expecting specifically named dimensions/variables.

groutr commented 3 years ago

@ChrisBarker-NOAA I think I cracked the case on UGRID compatibility. edge_dimension and face_dimension were not set on the mesh variable. After setting those two attributes, RGFGRID was able to read/render the mesh.

I'm working on a way to preserve names from a netcdf file. The mesh variable has defined keys like face_dimension and face_coordinates etc. I propose adding properties to UGrid object, namely, the keys that are defined in the mesh variable that are currently not part of the UGrid API. I created a Schema object (a dict subclass with some helper methods), to record the dimension and variable names and corresponding attributes.

Example API:

schema = Schema.from_nc(<filename>)
schema['edge_dimension']  # returns the name of the file edge dimension
schema['dimensions'] # a dictionary of objects representing the dimensions (not the netCDF4.Dimension objects). Stores name and size
schema['variables'] # a dictionary of variable info (except for the mesh variable which is stored in the top level dictionary

# Some helper methods to retrieve name of dimensions
schema['edge_dimension']  # results in something like mesh_num_edge
schema.get_dimension('mesh_num_edge')  # results in 'edge_dimension'
# Optionally pass in grid to evaluate to a value
schema.get_dimension('mesh_num_edge', mesh=grid)  # evaluates getattr(grid, schema.get_dimension('mesh_num_edge'))

# With a coordinate variable
schema.get_variable('mesh_node_x')  # results in something like ('node_coordinates', 0)
schema.get_variable('mesh_node_x', mesh=grid)  # evaluates getattr(grid, 'node_coordinates')[:,0]

When writing to a file the anticipated use would be:

for d in schema['dimensions']:
    ds.createDimension(d, schema.get_dimension(d, mesh=grid))
for name, v in schema['variables'].items():
    ds.createVariable(name, v['dtype'], dimensions=v['dimensions'])
    ...

For creating a file from scratch, a function could be written to generate names for each entity. Schemas could be validated to ensure that dimensions and variables are defined and names match. I would appreciate any feedback on integrating this with existing code. The schema could be stored in the UGrid object or kept separate. I'm interested to hear your thoughts.

groutr commented 3 years ago

I will work on mixed meshes in another issue/PR.

ChrisBarker-NOAA commented 3 years ago

hmm, I'm wondering about the amount of nesting here -- but you're probably right, keeping the names of everything in one object probably makes sense.

But where to put it? I think an attribute of the UGRid object makes sense.

When a UGRid is read from a netcd file, the "schema" object would get populated.

When a UGrid object was created "from scratch" -- the names object would get populated with default names -- or they could be hand-added.

Any other way to create a UGRid object would be responsible for populating the names object.

Have you tried putting any of this code in the existing code yet?

groutr commented 3 years ago

@ChrisBarker-NOAA Could elaborate on the nesting? Is it the amount of nesting in the schema itself? I basically tried to mirror the structure of the existing UGRID netcdf, but promoting the mesh variable attributes to the top level as I felt they were the most critical details. If you have ideas on a better layout, I'm happy to hear them.

I have most of it coded already I'll put in a PR. I was finishing up my proof of concept when I discovered the missing dimension attributes. I'll omit the integration bit and hopefully that will become clearer with some actual code to review.

ChrisBarker-NOAA commented 3 years ago

By nesting, I mean do we have:

grid.names.var_names.nodes_name

or

grid.names.node_name

or

grid.node_varname

right now, I'm thinking:

grid.var_names.nodes_name

which I think is more or less what you're doing.

so mesh_name is the only top-level attribute, and all the other names are only one level deep.

groutr commented 3 years ago

An example Schema object:

{'name': 'mesh',
'cf_role': 'mesh_topology',
'topology_dimension': 2,
'node_coordinates': ['mesh_nodes_x', 'mesh_nodes_y'],
'face_node_connectivity': 'mesh_face_nodes',
'data_model': 'NETCDF4',
'dimensions': {'mesh_nodes': {'name': 'mesh_nodes', size: 100}, 'two': {'name': 'two', 'size': 2}, 'mesh_faces': {'name': 'mesh_faces', 'size': 200}},
'variables': {'mesh_nodes_x': {'name': 'mesh_nodes_x', 'dtype': np.double, 'dimension': ('mesh_nodes',), 'attrs': {'units': 'degrees_east', 'standard_name': 'longitude'}, ...}

How this schema object should relate to the UGrid object is unclear to me. One possibility is to attach it via an attribute.

ChrisBarker-NOAA commented 3 years ago

Got it. my thoughts: 1) yes, attach it as an attribute 2) I'm not sure I like the name "schema" -- but that's minor. the actual scheme is pre-defined, these are just the names for things. 3) There are a few of the relationships that are fixed by UGRID -- as in what dimensions each variable has -- so the only thing that belong here is the names (though storing those relationships somewhere might make for less boilerplate code) 4) Maybe flatten it out a bit, e.g.:

'x_coords': 'mesh_nodes_x',
'y_coords': 'mesh_nodes_x',

Maybe make it a python object at the top level, e.g.

names.topology_dimension = 2
names.face_node_conectivity = 'mesh_face_nodes'

That would be consistent with the rest of the UGrid object, but,now that I think about it, might make for more boilerplate code :-)

groutr commented 3 years ago
  1. We can do that. However, first, I want to make sure we both have the same understanding of things.
  2. I agree. Schema is simply the term I use for a defined data layout. You'll note that in my PR, I named it the Mesh object.
  3. I think you're referring to how dimensions are stored. I store the size because it will be useful later for validation. My hope is to have a function that can check to make sure that a structure is consistent and standard-conforming.
  4. Flattening things out will ultimately undermine what I'm trying to achieve. I have no objection to possibly making things attributes. There should not be any boilerplate code here. If there is, something is wrong.

What I'm trying to achieve is subtle, but in my opinion, very useful. Perhaps I have failed to adequately explain my objective here. If we look at an example of a 1D topology: http://ugrid-conventions.github.io/ugrid-conventions/#1d-network-topology There are topology attributes with specific names like node_coordinates and edge_coordinates. The UGrid object roughly maps these names to attributes, namely: nodes and edge_coordinates. The first part of my proposed suggestion is that the UGrid object map these attribute names exactly (we're missing a grid.node_coordinates attribute). After completely mapping data to correctly named attributes (according to the spec), the second part of my proposed suggestion is to provide an object that can map the names of an incoming UGrid netcdf file to the attributes defined in the spec. A specific query would be "what does the name mymesh_node_lon map to?" The answer is that it maps to the first column of the array stored in grid.node_coordinates. This means that I can discover what attribute to inspect on the UGrid object just by knowing the name of it from the netcdf file. Conversely, when reproducing the file, it's the name we want to know/preserve when saving the UGrid object to a netcdf (a query of the form "what do I need to name the grid.node_coordinates attribute in this file?").

ChrisBarker-NOAA commented 3 years ago

Thanks for the clarifications. A few thoughts:

There are topology attributes with specific names like node_coordinates and edge_coordinates. The UGrid object roughly maps these names to attributes, namely: nodes and edge_coordinates. The first part of my proposed suggestion is that the UGrid object map these attribute names exactly (we're missing a grid.node_coordinates attribute).

That may be been the way to go in the first place, but I really don't want to change those names in the UGrid object at this point -- and adding aliases simply confuses things, and I'm not sure we buy much. But if it's only one name, maybe ....

to provide an object that can map the names of an incoming UGrid netcdf file to the attributes defined in the spec

yes, that is very useful -- see my notes on the PR -- that's required to read a netCDF file anyway. Currently, the rading code does all the introspection while reading the file, which is less than ideal.

This means that I can discover what attribute to inspect on the UGrid object just by knowing the name of it from the netcdf file

I'm not sure that it's very useful to ask the question in that direction -- the point of the UGrid object is to provide a consistent interface to ANY grid -- you can load a grid, and work with it with one API (and indeed, the point of gridded is that any operation that is not specific to a grid type presents that same API as well.

What is most useful is

A) "which variable in this netcdf file performs this particular function (i.e. is mapped to a UGrid attribute)

And the second one is: B) "what should I name this variable when I write this file".

And is is what could use some major refactoring, and B is a new feature.

But once you have that, yes you could ask what role a particular variable name in a netcdf file plays -- but that shouldn't be the priority.

BTW: I may be remembering wrong here, but I think the UGrid.nodes array is n x 2 for computational convenience in Python, but is stored as two 1D arrays in the netcdf, to conform to the CF conventions -- so that's not quite a one-to-one mapping.