Different results for numpy and dask arrays

sfinkens commented 3 years ago

Describe the bug The textory.statistics.variogram method yields different results for numpy and dask arrays.

To Reproduce

In [1]: from textory.statistics import variogram                  
In [2]: import dask.array as da

In [8]: arr = da.arange(100).reshape((10,10))

In [9]: variogram(arr, lag=4).compute()
Out[9]: 401.475

In [10]: variogram(arr.compute(), lag=4)
Out[10]: 251.49

Expected behavior The results to be identical :)

Environment Info:

OS: Linux
Textory Version: 0.2.7b0
Dask Version: 2021.08.1

BENR0 commented 3 years ago

@sfinkens thanks for reporting this. To a certain degree I know about this bug. It is a little sloppy and I wanted to document some limitations of how some calculations are implemented more thoroughly for quite some time.

That said, in this case the culprit is how border values are treated in map_overlap in the Dask case. Due to the fact that you use arange in the example the difference between the two calculations becomes very obvious. With random values the difference gets smaller. Never the less this is a bug.

I have been thinking about how to improve the accuracy of the calculations with regard to how borders are treated every now and then but I haven't come up with good ideas yet (which do make this significantly slower). I have a workaround in mind at least for the functions in the statistics module but haven't implemented and tested it yet. How quick would you need a solution?

Another question I have: How would you feel about an additional dependency like Numba from a user point of view?

sfinkens commented 3 years ago

Thanks for the feedback @BENR0 ! There's no hurry, we're still in a testing phase and evaluating the best solution for our application. That is computing block-wise variograms like so:

data_array.coarsen(lon=boxsize, lat=boxsize).reduce(textory.statistics.variogram, lag=lag)

We're already using numba in that project, so this would be fine.

BENR0 / textory

Different results for numpy and dask arrays #11