Open Manangka opened 1 year ago
In GitLab by @Huite on Nov 18, 2023, 17:47
The reason to use dask here is because if the spatial extent is large and the resolution is small, this will generate quite large (rasterio/GDAL) arrays in memory. So doing everything all at once is not a suitable solution.
Using dask dataframes may be a solution, but this code currently fails for me:
import pandas as pd
import dask.dataframe as dd
s = pd.Series([-1, 0, 0, 0, 1, 1])
print(s.median()) # 0.0
print(dd.from_pandas(s, 2).quantile(0.5).compute())
with:
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
In GitLab by @Huite on Nov 18, 2023, 17:52
Given the complexity/frailty here and actual scope (I expected almost everybody to use only sum, mean, etc.), I think we should support chunking only for commutative methods and special case them. For non-commutative methods, we try it in a single chunk.
If the dask dataframe methods are more mature, we could move to those in due time.
In GitLab by @Huite on Nov 18, 2023, 17:38
I got this report from Jacco.
If the resolution is small enough in these methods, it will result in duplicate entries in the output:
The reason is relatively simple, internally the data is chunked and processed out-of-core using dask.
Obviously, if a polygon is present in more than one chunk, the ID will be present more than once in the
result
, which is then just concatenated.This isn't too big a deal for methods such as
sum
ormean
, because all the information is available to do another reduction afterwards. But for something like mode, it won't work, since that requires ALL values.