benbovy / xproj

Xarray extension for projections and coordinate reference systems
Apache License 2.0
13 stars 0 forks source link

Single vs. multi CRS datasets #2

Open benbovy opened 17 hours ago

benbovy commented 17 hours ago

This is a big question that may generate lots of discussion: should we allow here only one CRS defined per xarray.Dataset / xarray.DataArray or should we support multiple CRS?

Why supporting multiple CRSs?

Xproj relies on scalar Xarray coordinates with a CRSIndex. This is inspired by the CF-conventions, and such kind of coordinate is used already in tools like rioxarray and odc-geo. AFAIK there doesn't seem to be any restriction in the CF-conventions about the number of grid mapping coordinate variables? (see CF-conventions 1.10, section 5.6, although I might have overlooked it? (EDIT: example 5.13 has both lat-lon geographic and x-y projected coordinate systems in the same dataset). There's no real technical barrier either in supporting multi-CRS with the current Xarray data model (coordinates + custom indexes).

Other Xarray extensions like xvec technically support multiple CRS (xvec currently encapsulates the CRS into the geometry coordinate / index). Although I’m not sure if multi-CRS vector data cubes exist and/or make sense, in theory there will be some friction in adopting xproj if the latter only works with single-CRS (breaking changes).

Single-CRS is easy to enforce in 3rd-party extensions. #1 provides a convenient API to work with single-CRS datasets or dataarrays, while still supporting the multi-CRS case.

Why this may be a bad idea?

Supporting multi-CRS datasets possibly opens a can of worms?

This is potentially confusing. I cannot imagine a DataArray representing a single raster (or a mosaic / stack of rasters) have multiple CRSs defined. I haven't checked but I guess that rioxarray and odc-geo won't work with multi-CRS (EDIT: I checked in rioxarray and it doesn't support multiple grid mapping attribute values). Although here again, single-CRS may be enforced in those libraries and no big deal if xproj provides a user-friendly, single-CRS API (#1)?


Any thoughts on this?

scottyhq commented 9 hours ago

I think a single CRS is the simpler conceptual data model and worth enforcing. While it may be possible to represent coordinates in multiple CRSs in a single object, I've yet to see a compelling use-case compared to the alternative of reprojecting coordinates to a new CRS during/after loading.

For vectors, the number of coordinates doesn't change between CRS (I think), but even so, libraries like geopandas operate under the assumption of a single 'active CRS' at a time

For a rasters, the number of coordinates will likely change between CRSs (under the assumption of maintaining the same dx,dy resolution), so I don't see that fitting well with the Xarray data model.

EDIT: example 5.13 has both lat-lon geographic and x-y projected coordinate systems in the same dataset

I've never actually come across a netCDF file that does this! Would be interested to know if others have? Currently, I think such a dataset would need to be opened with Xarray DataTree since the grid sizes are different.

benbovy commented 7 hours ago

since the grid sizes are different

Hmm I thought the example 5.13 in the CF-conventions (version 1.11) would look like below when loaded into a xarray Dataset?

>>> ds
<xarray.Dataset>
Dimensions:      (x: 18, y: 36)
Coordinates:
  * x           (x) float64 ...
  * y           (y) float64 ...
  * lat         (y, x) float64 ...
  * lon         (y, x) float64 ...
    crs_wgs84   int64 0
  * crs_osgb    int64 0
Data variables:
    temp         (y, x) float64 ...
Indexes:
    x            PandasIndex
    y            PandasIndex
    crs_osgb     CRSIndex (crs=EPSG:27700)

This assumes a single-CRS model (using the projected CRS as the "active" one).

Now assuming a multi-CRS model and a geographic Xarray index set for the lat/lon coordinates, it may look like:

>>> ds
<xarray.Dataset>
Dimensions:      (x: 18, y: 36)
Coordinates:
  * x           (x) float64 ...
  * y           (y) float64 ...
  * lat         (y, x) float64 ...
  * lon         (y, x) float64 ...
  * crs_wgs84   int64 0
  * crs_osgb    int64 0
Data variables:
    temp         (y, x) float64 ...
Indexes:
    x            PandasIndex
    y            PandasIndex
  ┌ lat          GeographyIndex
  └ lon
    crs_wgs84    CRSIndex (crs=EPSG:4326)
    crs_osgb     CRSIndex (crs=EPSG:27700)

(note: I plan to refactor xoak so that it provides such Xarray index built from 2D spatial coordinates, see also https://github.com/xarray-contrib/xoak/issues/34)

I find the latter example pretty illustrative and useful, actually (i.e., it allows selecting data either based on lat/lon or x/y coordinates). On a related note, I'm wondering if the reason why rioxarray doesn't support multiple grid mappings is because so far we haven't been able to do much with 2D spatial coordinates in Xarray.

I could imagine similar examples of vector data cubes were we can select data based on either planar or spherical geometries.

I agree that single-CRS is a simpler conceptual data model, and perhaps we could imagine the concept of "active" CRS for the example above? That said, as far as I understand the concept of "active" geometry column (and its CRS) in GeoPandas is rather specific to the dataframe model (often based on one index), which is different from the Xarray data model. The code below (API and behavior) looks pretty clear to me but sadly wouldn't be possible to write if we enforce single-CRS.

>>> ds.proj("crs_wgs84").crs
<Geographic 2D CRS: EPSG:4326>
...

>>> ds.proj("crs_osgb").crs
<Projected CRS: EPSG:27700>
...

>>> ds.proj.crs
ValueError: multiple CRSs found

>>> ds_latlon = ds.drop_vars(["x", "y", "crs_osgb"])

>>> ds_latlon.proj.crs
<Geographic 2D CRS: EPSG:4326>
...

>>>  # ... continue using `ds_latlon` in CRS-aware operations without any extra-step required

I'm sure a multi-CRS model would introduce some extra complexity for other things (re-projection API?), but I wonder if we can keep it under control.