corteva / rioxarray

geospatial xarray extension powered by rasterio
https://corteva.github.io/rioxarray
Other
517 stars 81 forks source link

BUG: Support writing GCPs to netCDF #787

Closed snowman2 closed 3 months ago

snowman2 commented 3 months ago

Note: No need to be backwards compatible because writing to netCDF didn't work previously and all other scenarios would have been in-memory only or to GeoTiff, which should work with this change.

@mraspaud

snowman2 commented 3 months ago

Thanks @justingruca :+1:

mraspaud commented 3 months ago

@snowman2 Sorry, long midsummer week-end here in Sweden, so I couldn't check this in time. This PR is actually introducing backwards incompatible changes as the gcps are now a string (dumped json) instead of a geojson dictionary. We could discuss which is the best, but we should at least have a deprecation cycle for this change, right?

mraspaud commented 3 months ago

This change actually breaks our code which expects gcps a dictionary. Looks like the test here was changed just for that purpose... https://github.com/corteva/rioxarray/pull/787/files#diff-11692352ce364dc8d376f73717ca1390e291d39a8f8de80526ada58ccdf5187cR2990

snowman2 commented 3 months ago

This change actually breaks our code which expects gcps a dictionary.

Apologies for that. I didn't realize you were reading the GCPs directly from the file instead of through the rioxarray interface. To help me better understand your use case, why doesn't the code use rio.get_gcps()?

snowman2 commented 3 months ago

This PR is actually introducing backwards incompatible changes as the gcps are now a string (dumped json) instead of a geojson dictionary. We could discuss which is the best, but we should at least have a deprecation cycle for this change, right?

Yes, a deprecation cycle would have been better. I just didn't foresee any potential breakages and figured it would be okay to proceed with the change :upside_down_face:

snowman2 commented 3 months ago

@mraspaud are there file formats you can write a dict to?

mraspaud commented 3 months ago

We are using the "gcps" attribute of the "spatial_ref" coordinate, not the get_gcps function, see here https://github.com/pytroll/satpy/blob/df4a1baef4594a0fa8b84f49fea2d7f35e37b979/satpy/readers/sar_c_safe.py#L667 And it's not about the writing, I totally see the point of writing the serialised json to file. But the type of that attribute changed. Maybe the serialisation should happen later in the chain?

snowman2 commented 3 months ago

Maybe the serialisation should happen later in the chain?

There currently aren't any hooks to make that automagically happen before the .to_netcdf call.

snowman2 commented 3 months ago

Would there be any issues changing this line:

gcps = self._data.coords["spatial_ref"].attrs["gcps"]

to:

gcps = self._data.rio.get_gcps()
snowman2 commented 3 months ago

Would need to change this:

        gcps = self._data.coords["spatial_ref"].attrs["gcps"]
        crs = self._data.rio.crs

        gcp_list = [(feature["properties"]["row"], feature["properties"]["col"], *feature["geometry"]["coordinates"])
                    for feature in gcps["features"]]
        gcp_array = np.array(gcp_list)

        ypoints = np.unique(gcp_array[:, 0])
        xpoints = np.unique(gcp_array[:, 1])

        gcp_lons = gcp_array[:, 2].reshape(ypoints.shape[0], xpoints.shape[0])
        gcp_lats = gcp_array[:, 3].reshape(ypoints.shape[0], xpoints.shape[0])
        gcp_alts = gcp_array[:, 4].reshape(ypoints.shape[0], xpoints.shape[0])

        rio_gcps = [rasterio.control.GroundControlPoint(*gcp) for gcp in gcp_list]

        return (xpoints, ypoints), (gcp_lons, gcp_lats, gcp_alts), (rio_gcps, crs)

to:

        gcps = self._data.rio.get_gcps()
        crs = self._data.rio.crs

        gcp_array = np.array([gcp.row, gcp.col, gcp.x, gcp.y, gcp.z for gcp in gcps])
        ypoints = np.unique(gcp_array[:, 0])
        xpoints = np.unique(gcp_array[:, 1])
        gcp_lons = gcp_array[:, 2].reshape(ypoints.shape[0], xpoints.shape[0])
        gcp_lats = gcp_array[:, 3].reshape(ypoints.shape[0], xpoints.shape[0])
        gcp_alts = gcp_array[:, 4].reshape(ypoints.shape[0], xpoints.shape[0])
        return (xpoints, ypoints), (gcp_lons, gcp_lats, gcp_alts), (gcps, crs)
mraspaud commented 3 months ago

Would there be any issues changing this line:

gcps = self._data.coords["spatial_ref"].attrs["gcps"]

to:

gcps = self._data.rio.get_gcps()

In principle, yes, it is possible.

However, the idea was to abstract from the underlying type and thus use the more interoperable geojson instead. That's why, when I implemented the gcp support, I made sure the geojson dictionary was available through the coordinate attribute.

But at the same time, I understand that this makes it difficult to have in a DataArray, as xarray doesn't know how to serialize dicts, and doesn't provide a mechanism for custom encoding of attributes.

As a compromise, would it be possible to implement a get_gcps as returning the geojson dict instead, and maybe have an alternative get_gdal_gcps to return the underlying gcp objects?

Regarding this PR, do you think we could reverse it for now, until the other functionalities are implemented?

snowman2 commented 3 months ago

First of all, I apologize for the trouble of this MR. I would handle it differently if I could go back. Now that it is added, I hesitate to revert it for the reasons below.

To help the discussion, I wanted to highlight two elements that are part of the design philosophy of rioxarray:

Storing the information as a dictionary inside of spatial_ref breaks the first element. I didn't catch that when you originally added this and was surprised when it came up in #778.

Modifying get_gcps to return a GeoJSON object breaks the second element. It will make it harder to pass GCPs around to the rasterio methods.

snowman2 commented 3 months ago

One workaround could be to do:

if isintance(gcps, str):
    gcps = json.loads(gcps)

Then, could cache that elsewhere. Alternatively, get_gcps could be modified to check for a string/dict and you could write gcps as a dict after the initial load.

djhoese commented 3 months ago

As someone who is mildly effected by this PR (Martin's Satpy code failing in Satpy CI), I unfortunately for Martin think I agree with @snowman2. One "hack" we've benefited from in some places of Satpy is putting python objects or complex structures inside of Xarray objects' attrs or coords. I/we found out late in Satpy's development that it is generally discouraged or even not supported by Xarray to store things this way and should be avoided if at all possible. I remember an Xarray PR a while ago that took at least one xarray release cycle to get fixed because we were using Xarray in a way the xarray maintainers did not anticipate.

So I think the rioxarray philosophy element 1 is inline with this idea of avoiding complex structures (or python objects) inside Xarray objects.

mraspaud commented 3 months ago

One workaround could be to do:

if isintance(gcps, str):
    gcps = json.loads(gcps)

Then, could cache that elsewhere. Alternatively, get_gcps could be modified to check for a string/dict and you could write gcps as a dict after the initial load.

Yes, I suppose that's the logical solution to keep gdal/rasterio encapsulated.

I'll try to make time before my holiday next week to change this, and to make a new release of satpy.

mraspaud commented 3 months ago

@snowman2 thanks for explaining the guiding principles of rioxarray, let's hope I can monitor my github notifications better in the future not to miss any urgent pings :)

snowman2 commented 3 months ago

Hopefully this won't be a regular occurrence 😅