zonal_aggregate_raster and zonal_aggregate_polygon may give unexpected/unusable results #654

Open Manangka opened 1 year ago

In GitLab by @Huite on Nov 18, 2023, 17:38

I got this report from Jacco.

If the resolution is small enough in these methods, it will result in duplicate entries in the output:

imod.prepare.zonal_aggregate_raster(peilvlakken_shp, 'Id', da_1, 1.0, 'mean')
   Id       area  aggregated
0   1  1066795.0         1.0
1   2   279899.0         1.0
2   3  1162011.0         1.0
3   4  1491911.0         1.0
4   5   258044.0         1.0
5   6   419897.0         1.0
6   7    89131.0         1.0
7  11   593048.0         1.0

imod.prepare.zonal_aggregate_raster(peilvlakken_shp, 'Id', da_1, 0.5, 'mean')
   Id        area  aggregated
0   2    94655.00         1.0
1  11   294333.25         1.0
0   1  1066068.50         1.0
1   2   183242.75         1.0
2   3  1162725.75         1.0
3   4  1489183.75         1.0
4   5   257467.75         1.0
5   6   420808.75         1.0
6   7    89104.25         1.0
7  11   214502.00         1.0
0  11    59612.75         1.0
0  11    23599.50         1.0

The reason is relatively simple, internally the data is chunked and processed out-of-core using dask.

    raster_chunks, _, _ = _create_chunks(raster, resolution, chunksize)
    collection = [
        dask.delayed(_zonal_aggregate_raster)(path, column, resolution, chunk, method)
        for chunk in raster_chunks
    ]
    result = dask.compute(collection)[0]
    return pd.concat(result)

Obviously, if a polygon is present in more than one chunk, the ID will be present more than once in the result, which is then just concatenated.

This isn't too big a deal for methods such as sum or mean, because all the information is available to do another reduction afterwards. But for something like mode, it won't work, since that requires ALL values.

In GitLab by @Huite on Nov 18, 2023, 17:47

The reason to use dask here is because if the spatial extent is large and the resolution is small, this will generate quite large (rasterio/GDAL) arrays in memory. So doing everything all at once is not a suitable solution.

Using dask dataframes may be a solution, but this code currently fails for me:

import pandas as pd

import dask.dataframe as dd

s = pd.Series([-1, 0, 0, 0, 1, 1])
print(s.median())  # 0.0
print(dd.from_pandas(s, 2).quantile(0.5).compute())

with:

AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

In GitLab by @Huite on Nov 18, 2023, 17:52

Given the complexity/frailty here and actual scope (I expected almost everybody to use only sum, mean, etc.), I think we should support chunking only for commutative methods and special case them. For non-commutative methods, we try it in a single chunk.

If the dask dataframe methods are more mature, we could move to those in due time.

Deltares / imod-python

zonal_aggregate_raster and zonal_aggregate_polygon may give unexpected/unusable results #654