corteva / rioxarray

geospatial xarray extension powered by rasterio
https://corteva.github.io/rioxarray
Other
507 stars 80 forks source link

Y-axis flipped when reading data with Xarray #650

Open RY4GIT opened 1 year ago

RY4GIT commented 1 year ago

Update

Update on March 15, 2023 based on discussion here https://github.com/pydata/xarray/issues/7621 First posted on March 13, 2023

Description

When reading data with Xarray using a rasterio engine, the Y axis is flipped. This happens with a specific dataset, and the summary is as follows.

Data\Engine NetCDF4 Rasterio
1. NSIDC output ❌ (Y axis is flipped)
2. NASA download

i.e., The issue happens with SMAP L4 data, pre-processed on the NSIDC server, read with rasterio engine. The issue does not happen with NetCDF4 engine. It did not happen with the one directly downloaded from NASA Earth Data.

When reading data with Xarray using both the netcdf4 engine and the rasterio engine, the Y axis is flipped. However, when reading the same data with the netcdf4 package, the issue doesn't seem to be occurring.

It's unclear whether the issue is with Xarray or the data itself.

image

Details

I have created a Jupyter notebook with the details of the issue, including the code used to analyze the data, which is available here

Data

The data can be downloaded from the links provided by NASA and NSIDC (the links are specified in the notebook), or from my Google Drive.

Environment Information

Installation method

Please let me know if you need any further information or have any suggestions on resolving this issue.

I am cross-posting this Xarray issues.

Please let me know if you need any further information or have any suggestions on resolving this issue. Thank you!

dcherian commented 1 year ago

Did you try the rioxarray engine instead of rasterio?

RY4GIT commented 1 year ago

@dcherian Sorry I realized I should have posted this in rasterio repo. I was confused rasterio and rioxarray! I will try them out and let you know soon, but you may close the issue if you think this doesn't fit here.

dcherian commented 1 year ago

No! This is the place. Just try engine="rioxarray". engine="rasterio" is bundled with xarray and is about be removed. rioxarray is a more full-featured version of the same funcitonality

snowman2 commented 1 year ago

Actually, engine="rasterio" comes from rioxarray ref. It calls rioxarray.open_rasterio.

dcherian commented 1 year ago

Oh that's confusing! In that case, i guess it's calling rioxarray since its installed.

RY4GIT commented 1 year ago

So I tried open_rasterio with NSIDC output, but the results are the same. The issue did not happen with data directly downloaded from NASA Earth data (see updates on https://github.com/pydata/xarray/issues/7621), so I am not sure it's an issue with rioxarray or the NSIDC output.

ds_NSIDC_rioxarray = rioxarray.open_rasterio(os.path.join(input_path, fn_NSIDC_output))
ds_NSIDC_rioxarray = ds_NSIDC_rioxarray.set_coords(["cell_lat", "cell_lon"])
ds_NSIDC_rioxarray

image

ds_NSIDC_rioxarray.precipitation_total_surface_flux[0].plot(y="cell_lat", x="cell_lon", vmin=-0.001) image

snowman2 commented 1 year ago

If you do this:

rds = rioxarray.open_rasterio(os.path.join(input_path, fn_NSIDC_output), group="Geophysical_Data")

or this:

rds = xarray.open_dataset(os.path.join(input_path, fn_NSIDC_output), group="Geophysical_Data", engine="rasterio")

The data plots correctly with rasterio.

EDIT: This does not fix the issue. The plot has the correct orientation. However, the data is missing coordinates.

snowman2 commented 1 year ago

If you do this, the lat/lon data looks correct:

ds_NSIDC_root  = xarray.open_dataset(data_path, variable=["cell_lon", "cell_lat"], engine='rasterio')
ds_NSIDC_root.cell_lat.plot()
snowman2 commented 1 year ago

This appears to be a rasterio/GDAL issue:

import rasterio
import netCDF4
import numpy

with netCDF4.Dataset(data_path) as nfh:
    nc_data = nfh['Geophysical_Data']['precipitation_total_surface_flux'][:]
with rasterio.open('netcdf:SMAP_L4_SM_gph_20150331T013000_Vv7032_001_HEGOUT.nc:/Geophysical_Data/precipitation_total_surface_flux') as rfh:
    rio_data = rfh.read(1, masked=True)
numpy.array_equal(nc_data, rio_data)
False
numpy.array_equal(nc_data, numpy.flip(rio_data, 0))
True

This is not something that can be resolved in rioxarray.

andypbarrett commented 5 months ago

Hi,

I'm just found this issue while searching for a related problem. Thank you for documenting it so well.

I'm actually working on a tutorial for NSIDC to demonstrate working with SMAP data. So this is quite timely.

I do not think it is a software/tool problem but how you are trying to read the data. I'm using the original HDF5 file from NSIDC but the result is the same. I'm guessing you used a reprojection service to get the dataset you posted here.

TLDR: I think the problem is that rasterio does not interpret the coordinate information and treats the data as an image - which is what rasterio is for. If you want the features offered by rioxarray you can use:

ds_rioxarray  = xr.merge([
    xr.open_dataset(filepath, decode_coords="all"),
    xr.open_dataset(filepath, group='Geophysical_Data', decode_coords="all")['precipitation_total_surface_flux'],
])

I'm using the following packages

from pathlib import Path

import xarray as xr
import rioxarray

print(xr.__version__)
print(rioxarray.__version__)
2024.2.0
0.15.0
filepath = Path("smap_data/SMAP_L4_SM_gph_20150331T013000_Vv7032_001.h5")

The SMAP datasets you are using are either HDF5 (the original file from NSIDC) or NetCDF4 and follow most of the CF-Conventions, at least for the Level-4 data. So you can use

engine = "netcdf4"
ds = xr.merge([
    xr.open_dataset(filepath, engine=engine),
    xr.open_dataset(filepath, group='Geophysical_Data', engine=engine)['precipitation_total_surface_flux'],
    ])
ds

Or

engine="h5netcdf"
ds = xr.merge([
    xr.open_dataset(filepath, engine=engine, phony_dims="sort"),
    xr.open_dataset(filepath, group='Geophysical_Data', engine=engine, 
                               phony_dims="sort")['precipitation_total_surface_flux'],
    ])
ds

These both return an xarray dataset like

<xarray.Dataset> Size: 175MB
Dimensions:                           (phony_dim_2: 1, y: 1624, x: 3856)
Coordinates:
  * x                                 (x) float64 31kB -1.736e+07 ... 1.736e+07
  * y                                 (y) float64 13kB 7.31e+06 ... -7.31e+06
Dimensions without coordinates: phony_dim_2
Data variables:
    EASE2_global_projection           (phony_dim_2) |S1 1B ...
    cell_column                       (y, x) float64 50MB ...
    cell_lat                          (y, x) float32 25MB ...
    cell_lon                          (y, x) float32 25MB ...
    cell_row                          (y, x) float64 50MB ...
    time                              (phony_dim_2) datetime64[ns] 8B ...
    precipitation_total_surface_flux  (y, x) float32 25MB ...
Attributes:
    Comment:      HDF-5
    Contact:      http://gmao.gsfc.nasa.gov/
    Conventions:  CF
    Filename:     /discover/nobackup/projects/gmao/smap/SMAP_L4/L4_SM/Vv7032/...
    History:      File written by ldas2daac.x
    Institution:  NASA Global Modeling and Assimilation Office
    References:   see SMAP L4_SM Product Specification Documentation
    Source:       v17.11.1
    Title:        SMAP L4_SM Geophysical (GPH) Data Granule

Note that the x and y coordinates are projected-coordinates for the EASE-Grid v2.0 Global CRS. In your example they are geographic coordinates.

What I think is happening when you use either the rasterio backend or rioxarray is that the HDF5 GDAL driver (which is what rasterio is using under-the-hood) is not able to parse the coordinate information.

engine = "rasterio"
ds_rasterio = xr.open_dataset(filepath, engine=engine)

This returns a bunch of warnings about not being able to determine georeferencing

[/home/apbarret/mambaforge/envs/nsidc-tutorials/lib/python3.9/site-packages/rioxarray/_io.py:1132](http://localhost:8890/home/apbarret/mambaforge/envs/nsidc-tutorials/lib/python3.9/site-packages/rioxarray/_io.py#line=1131): NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  warnings.warn(str(rio_warning.message), type(rio_warning.message))  # type: ignore

Printing x and y shows that the data are not georeferenced because the grid cell column and row indices - image coordinates - are returned not the projected coordinates.

print(ds_rasterio.x, rasterio.y)
<xarray.DataArray 'x' (x: 3856)> Size: 31kB
array([5.0000e-01, 1.5000e+00, 2.5000e+00, ..., 3.8535e+03, 3.8545e+03,
       3.8555e+03])
Coordinates:
  * x                        (x) float64 31kB 0.5 1.5 ... 3.854e+03 3.856e+03
    EASE2_global_projection  int64 8B ... <xarray.DataArray 'y' (y: 1624)> Size: 13kB
array([5.0000e-01, 1.5000e+00, 2.5000e+00, ..., 1.6215e+03, 1.6225e+03,
       1.6235e+03])
Coordinates:
  * y                        (y) float64 13kB 0.5 1.5 ... 1.622e+03 1.624e+03
    EASE2_global_projection  int64 8B ...

I suspect that rasterio is treating the data as an image, which by convention have the origin of image coordinates at the upper-left corner of the upper-left pixel. So pixel[0.5, 0.5] is the upper-left-most pixel center. However, when you plot the data with the xarray.DataArray.plot method, the coordinate sytem that pcolormesh (the default plotting method) uses is a normal cartesian coordinate system with the origin at the lower left, so the image appears upside down.

Plotting cell_lat shows that this is the case, southern hemisphere grid-cells are plotted at the top of the image.

ds_rasterio.cell_lat.plot()

smap_h5_cell_lat

Using ds_rasterio.cell_lat.squeeze().plot.imshow(origin="upper") to plot cell lat, which sets the origin to be the upper-left cell, gives the expected result: postive latitudes at the top of the image.

If you want to take advantage of the accessors and methods provided by rioxarray then you need to import rioxarray and set decode_coords="all". _Note: this throws a warning because there is no coordinate information in the Geophysical_Data group and xarray will only look in the group specified. IMO I don't think nested groups are necessary for these kinds of high-level products. Merging the root and precipitation datasets retains the georeferencing from the first call to open_dataset._

ds_rioxarray  = xr.merge([
    xr.open_dataset(filepath, decode_coords="all"),
    xr.open_dataset(filepath, group='Geophysical_Data', decode_coords="all")['precipitation_total_surface_flux'],
])

print(ds_rioxarray)
<xarray.Dataset> Size: 175MB
Dimensions:                           (phony_dim_2: 1, y: 1624, x: 3856)
Coordinates:
    EASE2_global_projection           (phony_dim_2) |S1 1B ...
  * x                                 (x) float64 31kB -1.736e+07 ... 1.736e+07
  * y                                 (y) float64 13kB 7.31e+06 ... -7.31e+06
Dimensions without coordinates: phony_dim_2
Data variables:
    cell_column                       (y, x) float64 50MB ...
    cell_lat                          (y, x) float32 25MB ...
    cell_lon                          (y, x) float32 25MB ...
    cell_row                          (y, x) float64 50MB ...
    time                              (phony_dim_2) datetime64[ns] 8B ...
    precipitation_total_surface_flux  (y, x) float32 25MB ...
Attributes:
    Source:       v17.11.1
    Institution:  NASA Global Modeling and Assimilation Office
    History:      File written by ldas2daac.x
    Comment:      HDF-5
    Filename:     /discover/nobackup/projects/gmao/smap/SMAP_L4/L4_SM/Vv7032/...
    Title:        SMAP L4_SM Geophysical (GPH) Data Granule
    Conventions:  CF
    References:   see SMAP L4_SM Product Specification Documentation
    Contact:      http://gmao.gsfc.nasa.gov/
[/tmp/ipykernel_230171/3642432009.py:3](http://localhost:8890/tmp/ipykernel_230171/3642432009.py#line=2): UserWarning: Variable(s) referenced in grid_mapping not in variables: ['EASE2_global_projection']
  xr.open_dataset(filepath, group='Geophysical_Data', decode_coords="all")['precipitation_total_surface_flux'],
ds_rioxarray.precipitation_total_surface_flux.plot(vmax=0.001)

smap_h5_precipitation