Closed Huite closed 2 years ago
This also has some interaction with #10 and how sel
should work. As mentioned there, a UgridDataset should be capable of representing multiple topologies (1D network, 2D mesh, etc.) in a single dataset. However, this means that whereas a DataArray has, through set_index
, a definite set of "x" and "y" coordinates, a Dataset does not.
For a data array .ugrid.sel()
always has "x" and "y" arguments, which are linked to the index that has been set.
However, for a Dataset, there may be multiple grids with their own index. These do not have to share coordinate systems, one might be lat-lon, the other one projected.
One option is to require, like xarray, a dict of indexers:
uds.ugrid.sel(mesh_node_x=..., mesh_node_y=...)
This would operate only on the Ugrid objects with those coordinates. It does mean that one additional operation is required for each topology in the dataset. This is likely the best approach: in a regular xarray dataset, multiple variables can be stored with different coordinates in the same coordinate system, and one cannot generate a subset of all them via a single operation either.
This is also easily extendable to 3D meshes.
More notes:
Currently, the Ugrid objects contains mostly numpy arrays for coordinates and connectivity. Additionally, it contains a dataset which is really only used as intermediate storage, before the topology is sent off to a dataset again when writing to file.
This is not a great idea: the data is duplicated and may become inconsistent. It also creates different states for when the grid has been created from a dataset or from topology (e.g. after mesh generation). There's something to be said for storing it in a dataset, since we'd like to preserve attributes and dimensions next to the values anyway.
We would like a perfect roundtrip. This requires storing all dimensions (stored anyway, except for the non-essential dimensions such as "two" for the number of nodes per edge), storing the attributes, and storing the names of the variables; the variable names are stored anyway, except that the coordinates field may contain many values. This is not that different from xarray, since they're available as "indexes".
Some examples:
Work:
from_dataset()
with multiple coordinates (lat-lon, projected)set_index
like method).to_dataset()
Work:
should result in a nice round trip without (significant) changes
assign_edge_coords
)to_dataset()
Work:
Implemented (with probably a few rough edges): 9563e83eeb93346bdae4a3ef9441a4ab3c9ebb4e
As clarified in issue 58 in the UGRID conventions, multiple
node_coordinates
(or edge, or face, etc.) are allowable.For the representation and indexing of spatial data, generally two dimensions (one x axis, one y axis)** are desirable. These may be generally be identified by their
standard_role
(e.g.projection_x_coordinate
). However, there is nothing that forbids multiple coordinates (for example both latitude, longitude and y, x). This doesn't have to be painful: thesel
operator currently takes just x and y, but these could be keys proper.In principle, the same is true for all the plotting functions that produce x,y (z) plots: xarray has x and y arguments for this reason. For convenience, when these are None, x and y are assumed
da.dims[0], da.dims[1]
.The difference with structured data is that we can generally include coordinate values for the dimensions (x and y, lon and lat). These are directly associated through the same name. This isn't the case for unstructured data: e.g. elevation has a
node_dimension
, andnode_x
andnode_y
both share thenode_dimension
; and the same fornode_lon
,node_lat
.However, xarray just has a single index for a dimension. In a short example:
In the majority of cases, only a single coordinate system with a single set of coordinates is of concern to the user. So a
set_index
on the accessor method should suffice. These are stored as the default x,y coordinates and the grid object; changing the index then also clears the spatial index (celltree, Rtree, whatever) and results in it being rebuilt for a next query. These default coordinates may also be used for the plotting methods as default arguments in the stead ofda.dims[0], da.dims[1]
.This rasies some some potential complications for edge and face coordinates: these may have been stored in the file and are available (e.g. both node_x, node_lon, edge_x, edge_lon). Currently, this should not pose a problem: they may all be rederived from the node coordinates (although there are multiple definitions for e.g. a centroid of a face). Perhaps the easiest is:
And clear them if not provided. Alternatively, a convenience method could use
standard_name
in combination with crs (since multiple projections may have the same standard name).** As mentioned in that issue, S2 coords may exist too, where it's a single number. I would reckon that is a too little academic at this stage: to produce a spatial plot, we always require translation to cartesian planes anyway (with or without an implicit Mercator projection, or cartopy...).