corteva / rioxarray

geospatial xarray extension powered by rasterio
https://corteva.github.io/rioxarray
Other
526 stars 83 forks source link

open_rasterio fails when NetCDF subdataset contains more than three dimensions #174

Open atedstone opened 4 years ago

atedstone commented 4 years ago

Code Sample

Example file: ftp://ftp.climato.be/fettweis/MARv3.11/Greenland/ERA_1958-2019-10km/daily_10km/MARv3.11-20km-daily-ERA-10km-2016.nc

import rioxarray as rxr
ds = rxr.open_rasterio('MARv3.11.2-6km-daily-ERA5-2016.nc')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-a16497ea4e6b> in <module>
----> 1 mar = rxr.open_rasterio('MARv3.11.2-6km-ERA5/MARv3.11.2-6km-daily-ERA5-2016.nc', mask_and_scale=False, parse_coordinates=False)

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/rioxarray/_io.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock, masked, mask_and_scale, variable, group, default_name, **open_kwargs)
    689             lock=lock,
    690             masked=masked,
--> 691             mask_and_scale=mask_and_scale,
    692         )
    693

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/rioxarray/_io.py in _load_subdatasets(riods, group, variable, parse_coordinates, chunks, cache, lock, masked, mask_and_scale)
    507             masked=masked,
    508             mask_and_scale=mask_and_scale,
--> 509             default_name=subdataset.split(":")[-1].lstrip("/").replace("/", "_"),
    510         )
    511         if shape not in dim_groups:

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/rioxarray/_io.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock, masked, mask_and_scale, variable, group, default_name, **open_kwargs)
    755         data = indexing.MemoryCachedArray(data)
    756
--> 757     result = DataArray(
    758         data=data, dims=(coord_name, "y", "x"), coords=coords, attrs=attrs, name=da_name
    759     )

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/xarray/core/dataarray.py in __init__(self, data, coords, dims, name, attrs, indexes, fastpath)
    342             data = _check_data_shape(data, coords, dims)
    343             data = as_compatible_data(data)
--> 344             coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
    345             variable = Variable(dims, data, attrs, fastpath=True)
    346             indexes = dict(

~/miniconda3/envs/geospatial/lib/python3.7/site-packages/xarray/core/dataarray.py in _infer_coords_and_dims(shape, coords, dims)
    145                 "coordinate %s has dimensions %s, but these "
    146                 "are not a subset of the DataArray "
--> 147                 "dimensions %s" % (k, v.dims, dims)
    148             )
    149

ValueError: coordinate SECTOR1_1 has dimensions ('SECTOR1_1',), but these are not a subset of the DataArray dimensions ('TIME', 'y', 'x')

Problem description

open_rasterio fails to load subdatasets (DataArrays?) from NetCDFs with more than 1 additional dimension to X, Y. Expressed alternatively, the maximum number of dimensions is 3 (e.g. TIME, y, x). This failure halts all further loading of the NetCDF file, in effect rendering the NetCDF unopenable by open_rasterio.

Expected Output

With pure xarray:

data = xr.open_dataset('MARv3.11.2-6km-daily-ERA5-2016.nc')

<xarray.Dataset>
Dimensions:    (ATMLAY3_3: 1, SECTOR: 2, SECTOR1_1: 1, TIME: 366, X12_251: 240, Y20_465: 446)
Coordinates:
  * X12_251    (X12_251) float32 -756.00006 -750.00006 ... 672.00006 678.00006
  * Y20_465    (Y20_465) float32 -1188.0001 -1182.0001 ... 1476.0001 1482.0001
  * SECTOR     (SECTOR) float32 1.0 2.0
  * TIME       (TIME) datetime64[ns] 2016-01-01T12:00:00 ... 2016-12-31T12:00:00
  * SECTOR1_1  (SECTOR1_1) float32 1.0
  * ATMLAY3_3  (ATMLAY3_3) float32 0.99974793
Data variables:
    LON        (Y20_465, X12_251) float32 ...
    LAT        (Y20_465, X12_251) float32 ...
    SH         (Y20_465, X12_251) float32 ...
    SRF        (Y20_465, X12_251) float32 ...
    SLO        (Y20_465, X12_251) float32 ...
    CZ         (Y20_465, X12_251) float32 ...
    SAL        (Y20_465, X12_251) float32 ...
    VEG        (SECTOR, Y20_465, X12_251) float32 ...
    FRV        (SECTOR, Y20_465, X12_251) float32 ...
    AREA       (Y20_465, X12_251) float64 ...
    MSK        (Y20_465, X12_251) float32 ...
    DATE       (TIME) float32 ...
    cnst       float64 ...
    MM         (TIME) float32 ...
    DD         (TIME) float32 ...
    SMB        (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
    RU         (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
    SF         (TIME, Y20_465, X12_251) float32 ...
    RF         (TIME, Y20_465, X12_251) float32 ...
    ME         (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
    TT         (TIME, ATMLAY3_3, Y20_465, X12_251) float32 ...
    QQ         (TIME, ATMLAY3_3, Y20_465, X12_251) float32 ...
    UV         (TIME, ATMLAY3_3, Y20_465, X12_251) float32 ...
    SP         (TIME, Y20_465, X12_251) float32 ...
    SWD        (TIME, Y20_465, X12_251) float32 ...
    LWD        (TIME, Y20_465, X12_251) float32 ...
    SHF        (TIME, Y20_465, X12_251) float32 ...
    LHF        (TIME, Y20_465, X12_251) float32 ...
    AL2        (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
    ST2        (TIME, SECTOR1_1, Y20_465, X12_251) float32 ...
Attributes:
    history:      FERRET V7.43 (optimized)  9-Jun-20FERRET V7.43 (optimized) ...
    Conventions:  CF-1.6

Environment Information

rioxarray (0.1.0) deps:
  rasterio: 1.1.7
    xarray: 0.16.1
      GDAL: 3.1.3

Other python deps:
     scipy: 1.5.2
    pyproj: 2.6.1.post1

System:
    python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)  [GCC 7.3.0]
executable: /home/tedstona/miniconda3/envs/geospatial/bin/python
   machine: Linux-3.10.0-1127.el7.x86_64-x86_64-with-centos-7.8.2003-Core

Installation method

conda

Conda environment information (if you installed with conda):


Environment (conda list):

``` gdal 3.1.3 py37h518339e_0 conda-forge libgdal 3.1.3 h670eac6_0 conda-forge rasterio 1.1.7 py37ha3d844c_0 conda-forge rioxarray 0.1.0 py_0 conda-forge xarray 0.16.1 py_0 conda-forge ```


Details about conda and system ( conda info ):

``` $ conda info active environment : geospatial active env location : /home/tedstona/miniconda3/envs/geospatial shell level : 2 user config file : /home/tedstona/.condarc populated config files : /home/tedstona/.condarc conda version : 4.8.3 conda-build version : not installed python version : 3.7.3.final.0 virtual packages : __glibc=2.17 base environment : /home/tedstona/miniconda3 (writable) channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch https://repo.anaconda.com/pkgs/main/linux-64 https://repo.anaconda.com/pkgs/main/noarch https://repo.anaconda.com/pkgs/r/linux-64 https://repo.anaconda.com/pkgs/r/noarch package cache : /home/tedstona/miniconda3/pkgs /home/tedstona/.conda/pkgs envs directories : /home/tedstona/miniconda3/envs /home/tedstona/.conda/envs platform : linux-64 user-agent : conda/4.8.3 requests/2.23.0 CPython/3.7.3 Linux/3.10.0-1127.el7.x86_64 centos/7.8.2003 glibc/2.17 UID:GID : 23002:23000 netrc file : None offline mode : False ```
snowman2 commented 4 years ago

This is currently not supported, but not against rioxarray adding support for it. The reason it currently does not exist is that it adds a bit of complexity when you have more than 3 dimensions: https://gdal.org/drivers/raster/netcdf.html#dimension

The dimensions inside the NetCDF file use the following rules: (Z,Y,X). If there are more than 3 dimensions, the driver will merge them into bands. For example if you have an 4 dimension arrays of the type (P, T, Y, X). The driver will multiply the last 2 dimensions (P*T). The driver will display the bands in the following order. It will first increment T and then P. Metadata will be displayed on each band with its corresponding T and P values.

Until support is added, I recommend opening your file with xarray.open_dataset. If this is something you would like to tackle, feel free to take a stab at it.

snowman2 commented 4 years ago

I think this would be simplified if rasterio added support for multidimensional arrays: https://gdal.org/drivers/raster/netcdf.html#multidimensional-api-support, however support for this hasn't been added yet and it doesn't appear to be on the roadmap.

atedstone commented 4 years ago

Thanks for your informative replies. I'll think about next steps - I'd forgotten the nuances of GDAL in only supporting a subset of netCDF functionality.

This driver is intended only for importing remote sensing and geospatial datasets in form of raster images. If you want explore all data contained in NetCDF file you should use another tools. https://gdal.org/drivers/raster/netcdf.html

The background here is that I want to use rioxarray's clip() functionality on this and other netCDF files. But perhaps rioxarray is not the best way to achieve this.

snowman2 commented 4 years ago

The background here is that I want to use rioxarray's clip() functionality on this and other netCDF files. But perhaps rioxarray is not the best way to achieve this.

There is some good news and some bad news here.

snowman2 commented 4 years ago

Looks like your x and y coordinates are in some kind of projected coordinate system:

>>> xds.X10_153
<xarray.DataArray 'X10_153' (X10_153: 144)>
array([-760., -750., -740., -730., -720., -710., -700., -690., -680., -670.,
       -660., -650., -640., -630., -620., -610., -600., -590., -580., -570.,
       -560., -550., -540., -530., -520., -510., -500., -490., -480., -470.,
       -460., -450., -440., -430., -420., -410., -400., -390., -380., -370.,
       -360., -350., -340., -330., -320., -310., -300., -290., -280., -270.,
       -260., -250., -240., -230., -220., -210., -200., -190., -180., -170.,
       -160., -150., -140., -130., -120., -110., -100.,  -90.,  -80.,  -70.,
        -60.,  -50.,  -40.,  -30.,  -20.,  -10.,    0.,   10.,   20.,   30.,
         40.,   50.,   60.,   70.,   80.,   90.,  100.,  110.,  120.,  130.,
        140.,  150.,  160.,  170.,  180.,  190.,  200.,  210.,  220.,  230.,
        240.,  250.,  260.,  270.,  280.,  290.,  300.,  310.,  320.,  330.,
        340.,  350.,  360.,  370.,  380.,  390.,  400.,  410.,  420.,  430.,
        440.,  450.,  460.,  470.,  480.,  490.,  500.,  510.,  520.,  530.,
        540.,  550.,  560.,  570.,  580.,  590.,  600.,  610.,  620.,  630.,
        640.,  650.,  660.,  670.], dtype=float32)
Coordinates:
  * X10_153  (X10_153) float32 -760.0 -750.0 -740.0 -730.0 ... 650.0 660.0 670.0
Attributes:
    units:          km
    long_name:      x
    point_spacing:  even
    axis:           X

In this scenario, if you know what the CRS is of the projected data, you could use rio.clip or rio.clip_box if you assign the CRS to the dataset using rio.write_crs first.

atedstone commented 4 years ago

Excellent. After some more digging, I've also seen your Stack Exchange answer which shows the use of write_crs in more detail. I can see about adding a page to the rasterio documentation which makes it clear that existing xarray Datasets can have a CRS added post-hoc.

atedstone commented 4 years ago

The model outputs that I've linked to in this issue do indeed have some peculiar properties, amongst which are insufficient attributes for most software to understand their georeferencing; units in kms; and X and Y names set based on clipping of the model domain in post-processing. So it's probably unreasonable to expect such files to ever be read in automatically by rioxarray.

snowman2 commented 2 years ago

Related: https://github.com/rasterio/rasterio/issues/1759

mdsumner commented 10 months ago

I think this is related, but in this case these "4D arrays" are degenerate in z and t, they are really 2D grids with a time stamp. There's a different error for the remote file vs local:

## remote 
import rioxarray
dsn = "NETCDF:/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202202/oisst-avhrr-v02r01.20220218.nc:sst"

rioxarray.open_rasterio(dsn)
KeyError: 'NETCDF_DIM_time_VALUES'

## local
import rioxarray
dsn = "NETCDF:oisst-avhrr-v02r01.20220218.nc:sst"
rioxarray.open_rasterio(dsn)

ValueError: coordinate zlev has dimensions ('zlev',), but these are not a subset of the DataArray dimensions ('time', 'y', 'x')

I'm certainly interested to have this case work in rioxarray and will explore it. Would it be acceptable to allow these? I've been down this path before (in commercial and open software) where the GDAL band metadata is used to re-store the dimensionality, but it's a mess and I don't think it's worth it given the xarray facilities and GDAL multidim that exist now.

I'm not interested in the multidimensional possibilities, I just want that obvious case where sequential bands or sequential files are lined up as "time", or perhaps "z". (If rioxarray won't work with them I'll probably make a copy as COGs and use that).

To show that GDAL sees these degenerate dims:

dataset = "/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202202/oisst-avhrr-v02r01.20220218.nc"
from osgeo import gdal
gdal.Open(dataset).GetMetadata("SUBDATASETS")

gdal.Open(dataset).GetMetadata("SUBDATASETS")["SUBDATASET_1_DESC"]
'[1x1x720x1440] sst (16-bit integer)'

rasterio works with them fine


from rasterio.windows import Window
import rasterio
dsn = "NETCDF:oisst-avhrr-v02r01.20220218.nc:sst"
ds = rasterio.open(dsn)
ds.read(window = Window(0, 0, 2, 5))

dsn = "NETCDF:/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202202/oisst-avhrr-v02r01.20220218.nc:sst"
ds = rasterio.open(dsn)
ds.read(window = Window(0, 0, 2, 5))
mdsumner commented 10 months ago

sure enough, so at line 1234 of rioxarray/rioxarray/_io.py I see

(1, 720, 1440)
dict_keys(['time', 'zlev', 'x', 'y'])
<xarray.IndexVariable 'zlev' (zlev: 1)>
array([0])
<xarray.IndexVariable 'time' (time: 1)>
array([16119])```

so, I think in this case it could unproblematically drop the zlev, but I can't see yet how to identify that it's the dim being left out. Will keep pursuing, I'm almost entirely unexperienced with python debugging but happy with progress so far 🙏

mdsumner commented 10 months ago

candidate fix here, at least for my issue: https://github.com/mdsumner/rioxarray/tree/fix-degen-dims

it works for the file locally, but there's still that error for the remote dsn which I'll continue with.

if reasonable I'll PR it - I think the linting is right, it will take me a while longer to get tests going. Thanks!

note that since GDAL 3.7 we can add this config to get the crs assumed under reasonable situations:

import rioxarray
import rasterio
dsn = "NETCDF:oisst-avhrr-v02r01.20220218.nc:sst"
with rasterio.Env(GDAL_NETCDF_ASSUME_LONGLAT="YES"):
    x = rioxarray.open_rasterio(dsn)

x.spatial_ref
<xarray.DataArray 'spatial_ref' ()>
array(0)
Coordinates:
    spatial_ref  int64 0
Attributes:
    crs_wkt:                      GEOGCS["WGS 84 (CRS84)",DATUM["WGS_1984",SP...
...

@atedstone is it possible to get your file listed above still? I'd like to explore

atedstone commented 10 months ago

@mdsumner The file I linked to above is no longer available, but this one is broadly equivalent:

ftp://ftp.climato.be/fettweis/MARv3.14/Greenland/ERA5-15km/MARv3.14.0-15km-daily-ERA5-2023.nc

snowman2 commented 10 months ago

I think this is related, but in this case these "4D arrays" are degenerate in z and t, they are really 2D grids with a time stamp.

This specific use case should be simple to support as it removed the complexity mentioned here: https://github.com/corteva/rioxarray/issues/174#issuecomment-713666696

However, it would need to throw errors if the extra dimensions have a size greater than 1 until the reading capability is updated to support that.

veenstrajelmer commented 7 months ago

I also encountered this issue. Since I have a very small dataset with 1D coordinates (but 4 dimensions), this simple reproducible example might help.

import rioxarray
import xarray as xr

file_nc = "cmems_so_2022-11-02.nc"
file_nc_reduced = file_nc.replace(".nc","_reduced.nc")

# create a dataset with the depth dimension reduced (only time/latitude/longitude remain)
ds = xr.open_dataset(file_nc)
ds_reduced = ds.max(dim="depth")
ds_reduced.to_netcdf(file_nc_reduced)

# this reduced dataset works
raster_red = rioxarray.open_rasterio(file_nc_reduced)

# the original dataset raises "ValueError: conflicting sizes for dimension 'time': length 50 on the data but length 1 on coordinate 'time'"
raster_full = rioxarray.open_rasterio(file_nc)

What might at least be useful is to improve the ValueError, since to me it was not clear that there was one dimension too much. It looked more like a mismatch in dimension orders. After trying transposing and swap_dims, I got the idea to reduce the depth dimension (with max, which is arbitrary in this case).

The ds looks like this:

<xarray.Dataset>
Dimensions:    (depth: 50, latitude: 11, longitude: 10, time: 1)
Coordinates:
  * depth      (depth) float32 0.494 1.541 2.646 ... 5.275e+03 5.728e+03
  * latitude   (latitude) float32 -1.583 -1.5 -1.417 ... -0.9167 -0.8333 -0.75
  * longitude  (longitude) float32 116.4 116.5 116.6 116.7 ... 117.0 117.1 117.2
  * time       (time) datetime64[ns] 2022-11-02
Data variables:
    so         (time, depth, latitude, longitude) float32 nan nan ... nan nan

The ds_reduced looks like this:

<xarray.Dataset>
Dimensions:    (latitude: 11, longitude: 10, time: 1)
Coordinates:
  * latitude   (latitude) float32 -1.583 -1.5 -1.417 ... -0.9167 -0.8333 -0.75
  * longitude  (longitude) float32 116.4 116.5 116.6 116.7 ... 117.0 117.1 117.2
  * time       (time) datetime64[ns] 2022-11-02
Data variables:
    so         (time, latitude, longitude) float32 nan nan 32.86 ... nan nan nan

Example dataset (extension changed to enable uploading): cmems_so_2022-11-02.nc.txt