XarrayReader produces upside down images for some datasets with xarray backend

developmentseed / titiler-cmr

Dynamic tiles from CMR queries

MIT License

6 stars 0 forks source link

XarrayReader produces upside down images for some datasets with xarray backend #34

Open hrodmn opened 1 month ago

hrodmn commented 1 month ago

I'm not sure if this is a problem in titiler-cmr or somewhere upstream (rio-tiler) but when reading images from some NetCDF files without doing a reprojection operation, the images are returned upside down!

This request to one NetCDF dataset returns the expected result: https://9ox7r6pi8c.execute-api.us-west-2.amazonaws.com/bbox/-99.524,19.043,-78.647,31.126.png?concept_id=C2036881735-POCLOUD&datetime=2024-09-20T00:00:00Z/2024-09-21T00:00:00Z&variable=analysed_sst&backend=xarray&rescale=273,325&colormap_name=thermal&return_mask=true

A request to a different NetCDF dataset (MUR SST) with all of the same parameters yields an upside down image! https://9ox7r6pi8c.execute-api.us-west-2.amazonaws.com/bbox/-99.524,19.043,-78.647,31.126.png?concept_id=C1996881146-POCLOUD&datetime=2024-09-20T00:00:00Z/2024-09-21T00:00:00Z&variable=analysed_sst&backend=xarray&rescale=273,325&colormap_name=thermal&return_mask=true

/tile requests to the second dataset return the expected result: https://9ox7r6pi8c.execute-api.us-west-2.amazonaws.com/tiles/WebMercatorQuad/5/8/13.png?concept_id=C1996881146-POCLOUD&datetime=2024-09-20T00:00:00Z/2024-09-21T00:00:00Z&variable=analysed_sst&backend=xarray&rescale=273,325&colormap_name=thermal&return_mask=true

When I specify a dst_crs that is not the dataset's native CRS, /bbox produces a correct image: https://9ox7r6pi8c.execute-api.us-west-2.amazonaws.com/bbox/-99.524,19.043,-78.647,31.126.png?concept_id=C1996881146-POCLOUD&datetime=2024-09-20T00:00:00Z/2024-09-21T00:00:00Z&variable=analysed_sst&backend=xarray&rescale=273,325&colormap_name=thermal&return_mask=true&dst_crs=epsg:3857

hrodmn commented 1 month ago

After digging into the issue some more I think the problem is in the default CRS definition for non-georeferenced datasets in titiler-cmr: https://github.com/developmentseed/titiler-cmr/blob/afe8b6dee5d2202fd8213080f986f3b66125d8b6/titiler/cmr/reader.py#L200-L201

The dataset that works as expected (GAMSSA) contains a georeference while the dataset that is not working correctly (MUR SST) does not contain a georeference and has it manually set to "epsg:4326". If I set the CRS using the CRS WKT from the GAMSSA dataset (instead of "epsg:4326", the MUR SST dataset behaves correctly.

diff --git a/titiler/cmr/reader.py b/titiler/cmr/reader.py
index 979629b..5e0af7d 100644
--- a/titiler/cmr/reader.py
+++ b/titiler/cmr/reader.py
@@ -197,7 +197,10 @@ def get_variable(
         da = da.sortby(da.x)

     # Make sure we have a valid CRS
-    crs = da.rio.crs or "epsg:4326"
+    crs = (
+        da.rio.crs
+        or 'GEOGCS["undefined",DATUM["undefined",SPHEROID["undefined",6379137,298.257223563]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Longitude",EAST],AXIS["Latitude",NORTH]]'
+    )
     da.rio.write_crs(crs, inplace=True)

This does not explain why everything works when a reprojection operation is performed without any changes to the default CRS, but it suggests that defaulting to epsg:4326 is probably not the right thing.

maxrjones commented 1 month ago

rioxarray will set the Affine transformation via the crs attribute and coordinates if it is not already defined in the dataset, which means the data gets "flipped" properly to match the destination transform in the reprojection step. rio_tiler's XarrayReader skips the reprojection entirely if the source and destination crs are the same - https://github.com/cogeotiff/rio-tiler/blob/abec15162c1d8cf9dd67cb5945d8532e2b6f562c/rio_tiler/io/xarray.py#L249-L263. If the bbox endpoint should always return an image that's oriented with y coordinates positive upwards, I'd guess this should be implemented at the rio-tiler level 🤔

I think the automatic assignment of a CRS is a broader issue. I don't agree with assuming that it's "epsg:4326" if it's not in the metadata. I guess there are a lot of datasets that are missing that metadata, so there should be a way for users to get around that still. Maybe a query parameter in the titiler-cmr endpoints?

hrodmn commented 1 month ago

After more investigation with @maxrjones and others, the problem is actually pretty simple. Neither of these datasets follow the convention of y axis coordinates positive upwards in the arrays.

The GAMSSA dataset has a georeference attached to it that rioxarray can find, and when rio-tiler checks to see if the CRS matches the dest_crs, it sees that they are not equal and performs a reprojection. This reprojection operation yields an array that does follow the conventional y-coordinate direction.

The MUR SST dataset does not have a georeference attached to it and gets the "epsg:4326" value hard-coded. In the part method, rio-tiler checks to see if self.input.rio.crs matches dest_crs (default: "epsg:4326") and they do match so no reprojection operation is performed. This means the array is never transformed to match conventional expectation of y-coordinates.

If the bbox endpoint should always return an image that's oriented with y coordinates positive upwards, I'd guess this should be implemented at the rio-tiler level 🤔

We should propose a fix in rio-tiler to ensure the array that is passed to ImageData matches expectation of y axis coordinates positive upwards.

I think the automatic assignment of a CRS is a broader issue. I don't agree with assuming that it's "epsg:4326" if it's not in the metadata. I guess there are a lot of datasets that are missing that metadata, so there should be a way for users to get around that still. Maybe a query parameter in the titiler-cmr endpoints?

Yes I agree. We should add a query parameter for src_crs that would allow an application to provide a CRS definition for datasets that do not have one attached, then raise an exception if the dataset is missing a CRS and src_crs is not supplied.

maxrjones commented 1 month ago

Here's the roughly equivalent rio_tiler code showing the inconsistent behavior between the Rasterio and Xarray readers:

import earthaccess
import rio_tiler
import xarray as xr
import matplotlib.pyplot as plt
from odc.geo.geom import Geometry
from shapely.geometry import box

# Download dataset
earthaccess.login()
variable = "analysed_sst"
results = earthaccess.search_data(
    concept_id="C2036881735-POCLOUD", count=1, temporal=("2024-09-20", "2024-09-20")
)
fp = earthaccess.download(results, "earthaccess_data")[0]

# Set up GeoJSON for reader
bbox = box(-99.524, 19.043, -78.647, 31.126)
gbox = Geometry(bbox, "EPSG:4326")
shape = gbox.geojson()

# Use Xarray reader
ds = xr.open_dataset(fp)
ds = ds.rio.write_crs("EPSG:4326")
r = rio_tiler.io.xarray.XarrayReader(ds[variable])
im = r.feature(shape)
plt.imshow(im.data.squeeze())
plt.colorbar()
plt.show()

# Use rasterio reader
r = rio_tiler.io.rasterio.Reader(f"NETCDF:{str(fp)}:analysed_sst")
im = r.feature(shape)
plt.imshow(im.data.squeeze())
plt.colorbar()
plt.show()

I'm not sure where mask_and_scale and masking gets handled with using the rasterio reader, which is the reason for the difference in values. I could open a PR tomorrow suggesting a change to the Xarray reader to return consistent array structure.

vincentsarago commented 1 month ago

I'm not sure where mask_and_scale and masking gets handled with using the rasterio reader, which is the reason for the difference in values. I could open a PR tomorrow suggesting a change to the Xarray reader to return consistent array structure.

@maxrjones can you try plt.imshow(im.data_as_image()) instead of plt.imshow(im.data.squeeze())

import earthaccess
import rio_tiler
import xarray as xr
import matplotlib.pyplot as plt
from geojson_pydantic import Polygon

# Download dataset
earthaccess.login()
variable = "analysed_sst"
results = earthaccess.search_data(
    concept_id="C2036881735-POCLOUD", count=1, temporal=("2024-09-20", "2024-09-20")
)
fp = earthaccess.download(results, "earthaccess_data")[0]

# Set up GeoJSON for reader
bbox = (-99.524, 19.043, -78.647, 31.126)
shape = Polygon.from_bounds(*bbox).model_dump(exclude_none=True)

# Use Xarray reader
print("xarray")
ds = xr.open_dataset(fp)
ds = ds.rio.write_crs("EPSG:4326")
r = rio_tiler.io.xarray.XarrayReader(ds[variable])
im = r.feature(shape)

plt.imshow(im.data_as_image())
plt.colorbar()
plt.show()

print("rasterio")
# Use rasterio reader
r = rio_tiler.io.rasterio.Reader(f"NETCDF:{str(fp)}:analysed_sst")
im = r.feature(shape)

plt.imshow(im.data_as_image())
plt.colorbar()
plt.show()

vincentsarago commented 1 month ago

Thank you both for digging on this issue 🙏

I think the automatic assignment of a CRS is a broader issue. I don't agree with assuming that it's "epsg:4326" if it's not in the metadata. I guess there are a lot of datasets that are missing that metadata, so there should be a way for users to get around that still. Maybe a query parameter in the titiler-cmr endpoints?

I totally agree 🙏

Yes I agree. We should add a query parameter for src_crs that would allow an application to provide a CRS definition for datasets that do not have one attached, then raise an exception if the dataset is missing a CRS and src_crs is not supplied.

I'm a bit worry about passing use WKT crs via query parameter, but I don't see better solution

vincentsarago commented 1 month ago

if we agree that not setting a CRS automatically in the XarrayReader, then might get issue when using titiler-cmr backend because we won't be able to easily forward the Error message up to the users 😬

vincentsarago commented 1 month ago

ok I think I may have a solution

import xarray

ds = xarray.open_dataset("20241012090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc", decode_coords="all")
da = ds["analysed_sst"]
print(da.rio.bounds())
print(da.rio.crs)
print(da.rio.transform())
print(da.rio._unordered_bounds())

(-179.99500549324037, -89.99499786365084, 180.0050000000763, 89.99499786365084)
None
| 0.01, 0.00,-180.00|
| 0.00, 0.01,-89.99|
| 0.00, 0.00, 1.00|
(-179.99500549324037, 89.99499786365084, 180.0050000000763, -89.99499786365084)

if you look at the tranform the y resolution is positive, while the bounds (form rio.bounds()) indicate that the transform Y resolution should be negative

rioxarray reorder the bounds, so if you look at the difference between da.rio._unordered_bounds() and da.rio.bounds() this might indicate that the transform should have a -Y resolution

hrodmn commented 1 month ago

This is another case where the vrt:// URI syntax shines. What if we could just use the rasterio backend for all of these NetCDF datasets? I don't know how to specify a vrt:// URI in a 4-D (lon, lat, time, variable) where you need to select a time slice AND a variable, but it might be possible (I haven't found an example dataset with that format yet).

This /vsicurl operation takes ~23 seconds on my laptop.

import earthaccess
import matplotlib.pyplot as plt
import rio_tiler

# query MUR SST from CMR
earthaccess.login()
results = earthaccess.search_data(
    concept_id="C1996881146-POCLOUD", count=1, temporal=("2024-10-22", "2024-10-22")
)

url = results[0].data_links()[0]

# Use rasterio reader to read a bbox
bbox = (-99.524, 19.043, -78.647, 31.126)
variable = "analysed_sst"

r = rio_tiler.io.rasterio.Reader(f"vrt:///vsicurl/{url}?sd_name={variable}")
im = r.part(bbox)
plt.imshow(im.data_as_image())
plt.show()

Does the XarrayReader have some performance advantage over the GDAL reader in this case?

hrodmn commented 1 month ago

I don't know how to specify a vrt:// URI in a 4-D (lon, lat, time, variable) where you need to select a time slice AND a variable, but it might be possible (I haven't found an example dataset with that format yet).

This example file has four dimensions: https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc

You can pick the variable with sd_name={variable} and a time slice with bands={i}.

gdalinfo "vrt:///vsicurl/https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc?sd_name=tos&bands=1"

maxrjones commented 1 month ago

I don't know how to specify a vrt:// URI in a 4-D (lon, lat, time, variable) where you need to select a time slice AND a variable, but it might be possible (I haven't found an example dataset with that format yet).

This example file has four dimensions: unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc

You can pick the variable with sd_name={variable} and a time slice with bands={i}.
gdalinfo "vrt:///vsicurl/https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc?sd_name=tos&bands=1"

How would this work with rio-tiler?

I tried:

import matplotlib.pyplot as plt
import rio_tiler

# Use rasterio reader to read a bbox
bbox = (-99.524, 19.043, -78.647, 31.126)
src = "vrt:///vsicurl/https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc?sd_name=tos&bands=1"

r = rio_tiler.io.rasterio.Reader(src)
im = r.part(bbox)
plt.imshow(im.data_as_image())
plt.show()

vincentsarago commented 1 month ago

You shouldn't need the /vsicurl/ prefix with rasterio

maxrjones commented 1 month ago

You shouldn't need the /vsicurl/ prefix with rasterio

Good to know, thanks! The bands= trick via VRT wasn't working for me, but it was possible to use indexes=1 with Reader.part(). Still, I'm not sure how you would go from an application/user specified time stamp to an index without a tool like xarray / rioxarray.

import matplotlib.pyplot as plt
import rio_tiler

# Use rasterio reader to read a bbox
bbox = (190, 19.043, 240, 31.126)
src = "https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc?sd_name=tos&bands=1"

r = rio_tiler.io.rasterio.Reader(src)
im = r.part(bbox, indexes=1)
plt.imshow(im.data_as_image())

@hrodmn FYI while that dataset has 4 dimensions, the tos variable only has three. concept_ID="G1991303139-POCLOUD" in an example of a dataset with 4 dimensional data variables (time: 1, Z: 50, latitude: 360, longitude: 720).