Open dcherian opened 2 years ago
Thanks for making the prototype. That looks pretty nice.
Do you think it would be better to wait before officially adding it to rioxarray so some wrinkles can be ironed out first? Or do you think that adding it to rioxarray will help the development process for indexes in xarray?
Thoughts:
CRSIndex
in rioxarray
in the rio.write_crs
method. That is used everywhere and should propagate to all of the rioxarray
methods that use the CRS
.adding it to rioxarray will help the development process for indexes in xarray?
I think this will be a very useful experiment.
dd the CRSIndex in rioxarray in the rio.write_crs method. That is used everywhere and should propagate to all of the rioxarray methods that use the CRS.
IMO it would be good to try this and see what fails in the test suite. That would be a more exhaustive test than my suggestion to pick a single "workflow" example./
Sounds like a good plan to me.
Anyone who would like to try implementing the CRSIndex and adding it to rio.write_crs
is welcome to, If you do get started, please post a comment here so we can coordinate our efforts. I will make sure to post in the thread if I get some free time to work on it.
Anyone who would like to try implementing the CRSIndex and adding it to rio.write_crs is welcome to, If you do get started, please post a comment here so we can coordinate our efforts.
@snowman2 we've finally been making some headway trying this out. Currently testing things on my fork in this PR https://github.com/scottyhq/rioxarray/pull/1, and borrowing from @benbovy's linked xvec implementation and @dcherian's prototype notebook. The example notebook in the PR highlights some use-cases and goals (https://github.com/scottyhq/rioxarray/blob/crsindex/docs/examples/crs_index.ipynb).
Nice :+1:
My first thought is to make use_crs_index=True/False
a global option ref. Another idea was to add a crs_index_error=True/False
global option to raise an exception if enabled and the CRS are different, otherwise raise a warning.
make use_crs_index=True/False a global option
Could this screw up a downstream package though? Or maybe we want that to shake out all the bugs :P
crs_index_error=True/False global option to raise an exception if enabled and the CRS are different,
It may be prudent to be strict now and loosen things later. If fixing the error is simply .rio.write_crs(...)
, then the error message could print out whats needed.
Could this screw up a downstream package though?
It works as a context manager and so it can be applied locally as well. Though, the current strategy would work as well. And as you mentioned, it would be good to have a way to do a thorough bug search.
It may be prudent to be strict now and loosen things later.
Since it is optionally enabled (and off by default), being strict is fine.
FYI @scottyhq and I are thinking of hacking on this during the SciPy sprints. let us know if anyone interested in this stuff will be around
I'm thinking about different use cases that would all benefit from an Xarray CRS-aware index but where the kind of index would greatly differ from one case to another:
Instead of duplicating the CRS-related logic in all those indexes, I'm wondering if alternatively we could provide somewhere a CRS "meta" index so that we can combine it with other indexes like CRSIndex(GeometryIndex)
, CRSIndex(RasterIndex)
, CRSIndex(KDTreeIndex)
, etc (not class inheritance but rather composition). The basic idea is to handle all CRS stuff in the index wrapper and delegate the rest to the wrapped index. Together with @keewis we are currently experimenting on a unit-aware (pint) index using that approach: https://github.com/xarray-contrib/pint-xarray/pull/163 (WIP). It seems to work well.
Some open questions:
cc @djhoese @martinfleis
Is is really possible to have a CRSIndex that is agnostic of the type of the index object it wraps?
It is not that easy, actually. We need to get the actual CRS information somewhere in order to build the CRSIndex. It is more manageable if each package (xvec, rioxarray, etc.) takes care of that rather than having a common package trying to support all different ways / standards to get that information. Also it is possible that the wrapped index needs to access the CRS for some computation (for example, would it be useful to have something like GeoBox
of https://github.com/opendatacube/odc-geo embed in a raster index?)
Forgive anything that is obvious to you and isn't to me about xarray Index objects. I've never created one and have never used a custom one.
I like your layout of special types of indexes and wrapping (decorating?) them with CRS information. I assume the main reason an index like these would need to know about the CRS is to allow or disallow or adapt to two xarray objects being merged? Like one of the examples in https://github.com/scottyhq/rioxarray/blob/crsindex/docs/examples/crs_index.ipynb where it shows doing +
between two Datasets that have different CRSes and getting an error. Right? Besides that, is there anything that the Index (and only the Index) needs to do with the CRS?
You mention odc-geo's GeoBox
(or other shapely-like geometry objects) being embedded in the Index. What does an Index do that would mean it needs access to that versus a odc-geo or geoxarray accessor having a property or method that produces a GeoBox
?
Also you mention cyclic dependencies. Are you saying that if CRSIndex when into geoxarray that would lead to a cyclic dependency with rioxarray?
I find the CRSIndex(GeometryIndex)
, CRSIndex(RasterIndex
), CRSIndex(KDTreeIndex)
logic a bit tricky. Because a user only sees CRSIndex
, not what it is subclassing. So if we have some methods on the GeometryIndex
but they are not on the KDTreeIndex
, the two classes called CRSIndex
would behave differently.
If we should do subclassing, then maybe the other way? That would allow us to have a common subset of functionality dealing with CRS and custom index handling on top of that. The core class may not even be an index but some form of a CRSHandler
that is a bit like pyproj.crs.CRS
class but potentially detached from PROJ.
Forgive anything that is obvious to you and isn't to me about xarray Index objects. I've never created one and have never used a custom one.
No worries at all! This is still very experimental and there's not much documentation or reading material about how to create and use them. TBH I also have no clear proposal for a "reusable" CRSIndex yet and I might be missing obvious CRS or other domain-specific things.
I assume the main reason an index like these would need to know about the CRS is to allow or disallow or adapt to two xarray objects being merged?
Yes I guess that's one of the main reasons. It could be used also for validating inputs given to Dataset (DataArray) .sel
or even re-project on-the-fly, although I don't think that the latter is a good idea. Maybe other use cases? For example, it is possible to have fine control on how the coordinates are propagated when performing operations on Datasets (DataArrays) via Index.create_variables.
What does an Index do that would mean it needs access to that versus a odc-geo or geoxarray accessor having a property or method that produces a GeoBox?
That is the question, actually. I don't know if/how a GeoBox could be used by an index. Support data selection by passing world coordinates and convert it as pixel coordinates (inverse affine transformation)? Support more advanced alignment behavior than allow / disallow?
About index vs. accessor, I think it relates to the more general question of whether a GeoBox is unique to a Dataset (DataArray) or is more closely related to a subset of coordinates (and to a less extent how hard / expensive it is to rebuild from scratch).
I find the CRSIndex(GeometryIndex), CRSIndex(RasterIndex), CRSIndex(KDTreeIndex) logic a bit tricky. Because a user only sees CRSIndex, not what it is subclassing.
Sorry the CRSIndex(IndexType)
notation was confusing. I meant a CRSIndex encapsulating another Xarray index, not subclassing.
It could be used also for validating inputs given to Dataset (DataArray) .sel or even re-project on-the-fly, although I don't think that the latter is a good idea. Maybe other use cases? For example, it is possible to have fine control on how the coordinates are propagated when performing operations on Datasets (DataArrays) via Index.create_variables.
What type of validation of inputs did you have in mind for .sel
? I could see units being checked, but I suppose that's more in the realm of your pint-enabled indexes and as long as the pint index uses the same units as the CRS then compatibility might come "for free". Otherwise in my experience if coordinates are outside the projection space they would be NaN or wouldn't be within the range of the index values anyway. Anyway, the merging of two incompatible CRS-based DataArrays is a good enough reason for me I think...assuming implementing this isn't very hard.
About index vs. accessor, I think it relates to the more general question of whether a GeoBox is unique to a Dataset (DataArray) or is more closely related to a subset of coordinates (and to a less extent how hard / expensive it is to rebuild from scratch).
If an Index is used by multiple xarray objects, is it the same instance of that index or does it get copied? I see a GeoBox or any other type of polygon as being a representation of the data coordinates, but separate from the Index. You could have an Index that uses a GeoBox internally I suppose. You could also have shapely polygons produced in various CRSes which is why I felt like you need access to the full DataArray coordinates and possibly attrs. Seems more accessor-y than index-y, but yeah maybe not familiar enough to comment beyond how this "feels" to me.
What type of validation of inputs did you have in mind for .sel?
CRS validation, assuming that the input labels passed to .sel
can be in the form of DataArrays each with (or without) a CRS attached to it.
I see a GeoBox or any other type of polygon as being a representation of the data coordinates, but separate from the Index.
Yes, although it actually often serves as a rough index itself.
You could also have shapely polygons produced in various CRSes which is why I felt like you need access to the full DataArray coordinates and possibly attrs. Seems more accessor-y than index-y, but yeah maybe not familiar enough to comment beyond how this "feels" to me.
I'll try to explain why this feels index-y to me (sorry in-advance if I'm repeating things for which you are already familiar). Beyond what we can expect from an "index", a custom Xarray Index is an arbitrary object that is closely tied to one or more coordinates and that offers some additional flexibility and benefit:
Index.create_variables()
, which is useful if we still want to have extra information always kept in-sync in coordinate attributes.If a representation (object) is specific to a whole DataArray (Dataset) and is cheap to re-build, then an accessor may be a good solution. If the representation is more specific to a subset of coordinates, then a custom Xarray index for those coordinates seems the right approach IMO. This doesn't prevent accessing that object from an accessor, though. I'd rather see a GeoBox encapsulated in a geo Index and accessed via an accessor than the other way around... Accessors are good for extending the DataArray (Dataset) API but not so good for extending its data model.
I'd rather see a GeoBox encapsulated in a geo Index and accessed via an accessor than the other way around... Accessors are good for extending the DataArray (Dataset) API but not so good for extending its data model.
I like this point.
I see from the xarray docs that we could have a meta index (or maybe there are other ways too) so we can have access to both x and y coordinates in a single CRS-based index. I could see this being important for having internal polygons or an index that represents the coordinates in a different CRS (or allows for inputs from other CRSes).
In your original recent comment from last week you said:
Dealing with the various ways to represent the CRS may be challenging
I'm not sure we ever exactly addressed this. Your later comment said:
It is not that easy, actually. We need to get the actual CRS information somewhere in order to build the CRSIndex. It is more manageable if each package (xvec, rioxarray, etc.) takes care of that rather than having a common package trying to support all different ways / standards to get that information
What various ways of representing the CRS were you thinking? I think rioxarray and geoxarray and other libraries should all (hopefully) be depending on pyproj's CRS
object. Any conversion from various ways to represent a CRS should be handled by that. The individual packages and their potential engines or accessors may create the CRS index that is then provided to an index, but I don't see a reason (yet) why that CRS creation can't be part of a common package (geoxarray). And at that point if this common library is being depended on to create the CRS then it should probably just have a utility (accessor or function or other) for adding the CRS-based index for you with the CRS that it parses/loads.
Sorry that was confusing!
I think rioxarray and geoxarray and other libraries should all (hopefully) be depending on pyproj's CRS object.
Agreed! Xvec uses pyproj's CRS too.
Let me try to illustrate my suggestion with an example.
CRSIndex encapsulates another arbitrary index and takes care of checking CRSes during alignment:
from typing import Any, Generic, Hashable, TypeVar
import xarray as xr
from pyproj import CRS
from xarray.indexes import Index
T = TypeVar("T", bound=Index)
class CRSIndex(Index, Generic[T]):
def __init__(self, index: T, crs: Any):
self._index = index
self._crs = CRS.from_user_input(crs)
def equal(self, other: Index):
if not isinstance(other, CRSIndex):
return False
if self._crs != other._crs:
return False
return self._index.equals(other._index)
def join(self, other: "CRSIndex[T]", how: str = "inner") -> "CRSIndex[T]":
if self._crs != other._crs:
raise ValueError
index = self._index.join(other._index, how=how)
return type(self)(index)
def reindex_like(
self, other: "CRSIndex[T]", method=None, tolerance=None
) -> dict[Hashable, Any]:
if self._crs != other._crs:
raise ValueError
return self._index.reindex_like(
other._index, method=method, tolerance=tolerance
)
We can couple that with an accessor to provide a convenient way to set a CRS index (this is what xvec already does for GeometryIndex):
@xr.register_dataarray_accessor("crs")
class CRSAccessor:
def __init__(self, xarray_obj: xr.Dataset | xr.DataArray):
self._obj = xarray_obj
def set_crs(self, coord_name: Hashable, crs: CRS):
crs_index = CRSIndex(self._obj.xindexes[coord_name], crs)
new_coords = xr.Coordinates(
{coord_name: self._obj[coord_name].variable}, {coord_name: crs_index}
)
return self._obj.assign_coords(new_coords)
Let's create two dataarrays with an xvec.GeometryIndex
(note: the latter already has support for CRS but here we intentionally not use it, i.e., crs=None).
import xvec
da1 = xr.DataArray(
[1, 2],
coords={"geom": [shapely.Point(1, 2), shapely.Point(3, 4)]},
dims="geom",
)
da1 = da1.xvec.set_geom_indexes("geom")
da2 = xr.DataArray(
[3],
coords={"geom": [shapely.Point(1, 2)]},
dims="geom",
)
da2 = da2.xvec.set_geom_indexes("geom")
xr.align(da1, da2)
# (<xarray.DataArray (geom: 1)>
# array([1])
# Coordinates:
# * geom (geom) object POINT (1 2)
# Indexes:
# geom GeometryIndex (crs=None),
# <xarray.DataArray (geom: 1)>
# array([3])
# Coordinates:
# * geom (geom) object POINT (1 2)
# Indexes:
# geom GeometryIndex (crs=None))
Now lets wrap those indexes so they become CRS-aware:
da1_crs = da1.crs.set_crs("geom", 26915)
da2_crs = da2.crs.set_crs("geom", 3857)
xr.align(da1_crs, da2_crs)
# ValueError (CRS mismatch)
This is just an example, we could have used a RasterIndex, a KDTreeIndex, etc. instead of a GeometryIndex.
Now, the CRSIndex above doesn't do many things apart from checking if the CRSes match, so maybe it is not a big deal implementing the same logic for RasterIndex, GeometryIndex, etc.? CRSIndex stills provides a nice way to enable CRS-aware alignment via a uniform API (and maybe other features could be added).
Alternatively to explicitly providing the CRS via DataArray.crs.set_crs()
, we could extract the CRS from the wrapped index, but this would need to have a common interface for all underlying index types.
I could see some functionality being dependent on if you're doing something in 2D or 3D in your projection space. I'm not sure that prevents both sets of functionality from living in one CRSIndex object, but it is one of the first things that came to mind.
Alternatively to explicitly providing the CRS via DataArray.crs.set_crs(), we could extract the CRS from the wrapped index, but this would need to have a common interface for all underlying index types.
Wouldn't that only require a .crs
property on the Index? I'd hope that we could all agree on that.
I am still not convinced about this idea. It gets tricky as soon as you want to do something more complicated. What about re-projecting if setting the CRS happens on the CRSIndex
level? Shall I do DataArray.crs.set_crs()
to set it but DataArray.xvec.to_crs()
to re-project to another CRS? I guess that should be (from a user point of view) DataArray.crs.to_crs()
but at that point I don't know how that would interact with the wrapped index without additional restrictions on what a common API should look like.
It also feels a bit opaque to see that everything is a CRSIndex rather than specific custom solutions.
I am still not convinced about this idea.
Yeah this idea has still to be more "battle tested" against possible uses cases. I think it is useful to have some way of adding CRS-aware behavior on top of a generic, non-geospatial index. For example, the xoak library will provide (once refactored) indexes that are useful for geospatial and other applications but I think it is outside of the scope of that library to provide any CRS support (and to depend on pyproj).
We could create somewhere else ad-hoc wrappers for each case, e.g., CRSKDTreeIndex(xoak.KDTreeIndex)
, at the risk that this might lead to a rapid increase of various xarray geospatial indexes in the wild with subtle differences in both API and behavior regarding how they handle the CRS. Are there other possible approaches that would mitigate that risk?
I don't know how that would interact with the wrapped index without additional restrictions on what a common API should look like.
Fair point. We would need to agree on a slightly larger protocol than just a .crs
property for a wrapped index to interact with the CRS. Does it seem reasonable? Are there many operations other than set_crs
and to_crs
that should go through CRSIndex? For more specific CRS-aware operations, we could still leave that to the wrapped index.
What about re-projecting if setting the CRS happens on the CRSIndex level?
In such case, it is probably OK to either raise an error or re-project the coordinate data and then rebuild from scratch the wrapped index?
It also feels a bit opaque to see that everything is a CRSIndex rather than specific custom solutions.
I don't think that it is a big deal here. Users do (should) not interact directly with xarray Index objects. We can also make it more transparent through the CRSIndex (inline) repr, similarly to how we represent xarray coordinate or data variables wrapping (possibly nested) duck arrays. I'd be more concerned by how to discover and pass options to the wrapped index (e.g., https://github.com/pydata/xarray/issues/8002).
It'd be nice to experiment some with this prototype CRSIndex (though there are still bugs and work to do)
We could experiment with optionally adding such an index through
rioxarray.open_dataset(..., use_crsindex=True)
and then get some experience with it before refactoring it out to a ecosystem package likegeoxarray
Reprojection seems like a high value application where assigning a new CRSIndex would be quite impactful.
Is there an existing example that might benefit a lot from propagating CRS information automatically?
cc @djhoese
Original Xarray Issue - "Add CRS/projection information to xarray objects"