corteva / rioxarray

geospatial xarray extension powered by rasterio
https://corteva.github.io/rioxarray
Other
511 stars 81 forks source link

rio.reproject_match adds unnecessary/unused _FillValue attribute. #570

Closed bertcoerver closed 1 year ago

bertcoerver commented 2 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
import xarray as xr
import rioxarray

# Create some dataset to match to.
example_ds = xr.Dataset(None, coords = {"time": pd.date_range("2022-02-01", "2022-02-11", periods = 10), "y": np.linspace(30, 32, 100), "x": np.linspace(70, 75, 210)})
example_ds = example_ds.rio.write_crs(4326)

# Create a dataset to match with `example_ds`.
data = np.random.random((10, 50, 60))
data[:, 20:30, 20:30] = np.nan
data_ds = xr.Dataset({"my_var": (["time", "y", "x"], data)}, coords = {"time": pd.date_range("2022-02-01", "2022-02-11", periods = 10), "y": np.linspace(29.8, 32.2, 50), "x": np.linspace(69.8, 75.2, 60)})
data_ds = data_ds.rio.write_crs(4326)

# Do `rio.reproject_match`.
out = data_ds.rio.reproject_match(example_ds)

print("`my_var` _FillValue attribute: ", out["my_var"]._FillValue)
print("Max value in da:               ", out["my_var"].max().values)
print("There are NaNs:                ", out["my_var"].isnull().sum().values, "\n")
print(out)
`my_var` _FillValue attribute:  1.7976931348623157e+308
Max value in da:                0.9999459127985011
There are NaNs:                 9120 

<xarray.Dataset>
Dimensions:      (x: 210, y: 100, time: 10)
Coordinates:
  * x            (x) float64 70.0 70.02 70.05 70.07 ... 74.93 74.95 74.98 75.0
  * y            (y) float64 30.0 30.02 30.04 30.06 ... 31.94 31.96 31.98 32.0
  * time         (time) datetime64[ns] 2022-02-01 ... 2022-02-11
    spatial_ref  int64 0
Data variables:
    my_var       (time, y, x) float64 0.03746 0.03746 0.3895 ... 0.4467 0.4467

Problem description

When matching two datasets with each other using rio.reproject_match, each variable in the output dataset has an _FillValue attribute that is (1) very large, giving problems when saving (OverflowError: Python int too large to convert to C long), (2) unnecessary, since the arrays for each variable contain np.nan's, (3) unused since the value isn't present in the array.

I can fix it by removing the attribute manually, but it would be more convenient if it wasn't added at all. Hadn't noticed this before, so perhaps it's related to some changes in the latest version (0.12.0)?

Environment Information

rioxarray (0.12.0) deps:
  rasterio: 1.3.2
    xarray: 2022.6.0
      GDAL: 3.5.1
      GEOS: 3.11.0
      PROJ: 9.0.1
 PROJ DATA: /Users/hmcoerver/opt/miniconda3/envs/pywapor/share/proj
 GDAL DATA: /Users/hmcoerver/opt/miniconda3/envs/pywapor/share/gdal

Other python deps:
     scipy: 1.9.1
    pyproj: 3.3.1

System:
    python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
executable: /Users/hmcoerver/opt/miniconda3/envs/pywapor/bin/python
   machine: macOS-12.5.1-arm64-arm-64bit

Installation method

Conda environment information (if you installed with conda):


Environment (conda list):

``` gdal 3.5.1 py310hc67b115_5 conda-forge libgdal 3.5.1 he1a18a7_5 conda-forge rasterio 1.3.2 py310ha36aacf_0 conda-forge rioxarray 0.12.0 pyhd8ed1ab_0 conda-forge xarray 2022.6.0 pyhd8ed1ab_1 conda-forge ```


Details about conda and system ( conda info ):

``` active environment : pywapor active env location : /Users/hmcoerver/opt/miniconda3/envs/pywapor shell level : 2 user config file : /Users/hmcoerver/.condarc populated config files : conda version : 4.13.0 conda-build version : 3.21.9 python version : 3.9.12.final.0 virtual packages : __osx=12.5.1=0 __unix=0=0 __archspec=1=arm64 base environment : /Users/hmcoerver/opt/miniconda3 (writable) conda av data dir : /Users/hmcoerver/opt/miniconda3/etc/conda conda av metadata url : None channel URLs : https://repo.anaconda.com/pkgs/main/osx-arm64 https://repo.anaconda.com/pkgs/main/noarch https://repo.anaconda.com/pkgs/r/osx-arm64 https://repo.anaconda.com/pkgs/r/noarch package cache : /Users/hmcoerver/opt/miniconda3/pkgs /Users/hmcoerver/.conda/pkgs envs directories : /Users/hmcoerver/opt/miniconda3/envs /Users/hmcoerver/.conda/envs platform : osx-arm64 user-agent : conda/4.13.0 requests/2.28.1 CPython/3.9.12 Darwin/21.6.0 OSX/12.5.1 UID:GID : 501:20 netrc file : None offline mode : False ```
snowman2 commented 2 years ago

The value for the nodata is what is used to fill in the regions without data. If you don't provide one, rioxarray will use the limits of the datatype you are using. If you want to use nan for your fill value, that is valid. For more details, see: https://corteva.github.io/rioxarray/stable/getting_started/nodata_management.html

snowman2 commented 2 years ago
data_ds.rio.write_nodata(numpy.nan, inplace=True)
bertcoerver commented 2 years ago

The value for the nodata is what is used to fill in the regions without data. If you don't provide one, rioxarray will use the limits of the datatype you are using. If you want to use nan for your fill value, that is valid. For more details, see: https://corteva.github.io/rioxarray/stable/getting_started/nodata_management.html

Yes but the regions in my dataset without data are not filled with that value (1.7976931348623157e+308), but with np.nan, so it's incorrect right?

That attribute will give problems when trying to save to disk and providing a _FillValue as encoding. I thought that in xarray pixels with _FillValue are replaced with np.nan and the attribute is removed when opening data. Unless when mask_and_scale = False is passed to xr.open_dataset in which case the data will actually contain the _FillValue (such as 1.7976931348623157e+308) and the attribute will be present to notify you that those values indicate missing data.

snowman2 commented 2 years ago

If you do:

data_ds.rio.write_nodata(numpy.nan, inplace=True)

Then rioxarray will use numpy.nan as your fill value. If you don't tell it to use numpy.nan and a different fill value is not in the encoded nodata, rioxarray will choose a nodata value based on the dtype.

More details here: https://corteva.github.io/rioxarray/stable/getting_started/nodata_management.html

bertcoerver commented 1 year ago

ok, thanks for the help.