Deltares / xugrid

Xarray and unstructured grids
https://deltares.github.io/xugrid/
MIT License
66 stars 8 forks source link

How to deal with multiple (x,y)-like coordinates? #15

Closed Huite closed 2 years ago

Huite commented 2 years ago

As clarified in issue 58 in the UGRID conventions, multiple node_coordinates (or edge, or face, etc.) are allowable.

For the representation and indexing of spatial data, generally two dimensions (one x axis, one y axis)** are desirable. These may be generally be identified by their standard_role (e.g. projection_x_coordinate). However, there is nothing that forbids multiple coordinates (for example both latitude, longitude and y, x). This doesn't have to be painful: the sel operator currently takes just x and y, but these could be keys proper.

In principle, the same is true for all the plotting functions that produce x,y (z) plots: xarray has x and y arguments for this reason. For convenience, when these are None, x and y are assumed da.dims[0], da.dims[1].

The difference with structured data is that we can generally include coordinate values for the dimensions (x and y, lon and lat). These are directly associated through the same name. This isn't the case for unstructured data: e.g. elevation has a node_dimension, and node_x and node_y both share the node_dimension; and the same for node_lon, node_lat.

However, xarray just has a single index for a dimension. In a short example:

import numpy as np
import xarray as xr

da = xr.DataArray(
    data=np.random.rand(2, 2),
    dims=["y", "x"],
    coords={"y": [5, 4], "x": [1, 2], "y1": ("y", [50, 40]), "x1": ("x", [10, 20])},
)

da.sel(x=slice(0.5, 1.5))   
da.sel(x1=slice(5, 15))   # KeyError: 'no index found for coordinate x1'

da2 = da.set_index({"x": "x1", "y": "y1"})
da.sel(x=slice(5, 15))  # works

In the majority of cases, only a single coordinate system with a single set of coordinates is of concern to the user. So a set_index on the accessor method should suffice. These are stored as the default x,y coordinates and the grid object; changing the index then also clears the spatial index (celltree, Rtree, whatever) and results in it being rebuilt for a next query. These default coordinates may also be used for the plotting methods as default arguments in the stead of da.dims[0], da.dims[1].

This rasies some some potential complications for edge and face coordinates: these may have been stored in the file and are available (e.g. both node_x, node_lon, edge_x, edge_lon). Currently, this should not pose a problem: they may all be rederived from the node coordinates (although there are multiple definitions for e.g. a centroid of a face). Perhaps the easiest is:

def set_index(node_x, node_y, edge_x=None, edge_y=None, face_x=None, face_y=None):
    ...

And clear them if not provided. Alternatively, a convenience method could use standard_name in combination with crs (since multiple projections may have the same standard name).

** As mentioned in that issue, S2 coords may exist too, where it's a single number. I would reckon that is a too little academic at this stage: to produce a spatial plot, we always require translation to cartesian planes anyway (with or without an implicit Mercator projection, or cartopy...).

Huite commented 2 years ago

This also has some interaction with #10 and how sel should work. As mentioned there, a UgridDataset should be capable of representing multiple topologies (1D network, 2D mesh, etc.) in a single dataset. However, this means that whereas a DataArray has, through set_index, a definite set of "x" and "y" coordinates, a Dataset does not.

For a data array .ugrid.sel() always has "x" and "y" arguments, which are linked to the index that has been set. However, for a Dataset, there may be multiple grids with their own index. These do not have to share coordinate systems, one might be lat-lon, the other one projected.

One option is to require, like xarray, a dict of indexers:

uds.ugrid.sel(mesh_node_x=..., mesh_node_y=...)

This would operate only on the Ugrid objects with those coordinates. It does mean that one additional operation is required for each topology in the dataset. This is likely the best approach: in a regular xarray dataset, multiple variables can be stored with different coordinates in the same coordinate system, and one cannot generate a subset of all them via a single operation either.

This is also easily extendable to 3D meshes.

Huite commented 2 years ago

More notes:

Currently, the Ugrid objects contains mostly numpy arrays for coordinates and connectivity. Additionally, it contains a dataset which is really only used as intermediate storage, before the topology is sent off to a dataset again when writing to file.

This is not a great idea: the data is duplicated and may become inconsistent. It also creates different states for when the grid has been created from a dataset or from topology (e.g. after mesh generation). There's something to be said for storing it in a dataset, since we'd like to preserve attributes and dimensions next to the values anyway.

We would like a perfect roundtrip. This requires storing all dimensions (stored anyway, except for the non-essential dimensions such as "two" for the number of nodes per edge), storing the attributes, and storing the names of the variables; the variable names are stored anyway, except that the coordinates field may contain many values. This is not that different from xarray, since they're available as "indexes".

Some examples:

1

Work:

2

Work:

should result in a nice round trip without (significant) changes

3

Work:

Huite commented 2 years ago

Implemented (with probably a few rough edges): 9563e83eeb93346bdae4a3ef9441a4ab3c9ebb4e