cogeotiff / rio-tiler

User friendly Rasterio plugin to read raster datasets.
https://cogeotiff.github.io/rio-tiler/
BSD 3-Clause "New" or "Revised" License
502 stars 106 forks source link

Incorrect array statistics with (coverage) weights #680

Closed j08lue closed 5 months ago

j08lue commented 6 months ago

6.2.0 introduced the possibility to calculate array statistics with coverage weights.

However, we need to review the way the weighted stats are calculated, in the get_array_statistics function.

https://github.com/cogeotiff/rio-tiler/blob/73175d0607ef28bffcb6f03ff8dfd2d6abfbc191/rio_tiler/utils.py#L36

  1. mean - we currently take the mean of the pixel values scaled by the weights. It should be the sum of scaled values divided by sum of weights - or use numpy.average(..., weights=coverage).
    assert stats[0]["mean"] == 1.125  # (1 * 0.5 + 2 * 0.0 + 3 * 1.0 + 4 * 0.25) / 4

    should be / 1.75

  2. std, percentiles, median - we need to use weighted stats for these. Those are simple Numpy formulae - weighted standard deviation, weighted percentiles, and the weighted median would be the 0.5 weighted percentile.
  3. count - what is that for weighted data? Count of all finite values where weights are non-zero? Or the sum of the weights?
  4. min, max, minority, majority, unique and also histogram - we should probably only take values into account where weights are non-zero.
  5. sum is correct (sum of weighted values)
vincentsarago commented 6 months ago

Thanks @j08lue 🙏

count - what is that for weighted data? Count of all finite values where weights are non-zero? Or the sum of the weights?

I would sum the weights I think, I think it's the current behaviour: https://github.com/cogeotiff/rio-tiler/blob/main/rio_tiler/utils.py#L139

min, max, minority, majority, unique and also histogram - we should probably only take values into account where weights are non-zero.

so in https://github.com/cogeotiff/rio-tiler/blob/73175d0607ef28bffcb6f03ff8dfd2d6abfbc191/rio_tiler/utils.py#L134-L163 we should use array instead of data

j08lue commented 6 months ago

I would sum the weights

Yes, that must be the right way.

we should use array instead of data

Hmm, no, I think we need to get rid of multiplying coverage with data for all stats and only do that for mean.

What I meant was to make sure we expand the mask of data by the points where coverage equals zero. Since we compress that array (take out masked elements), we would then also exclude pixels with no coverage.

https://github.com/cogeotiff/rio-tiler/blob/73175d0607ef28bffcb6f03ff8dfd2d6abfbc191/rio_tiler/utils.py#L97

Ideally, data.mask and coverage == 0 should be identical anyways, if data.mask is the rasterized query polygon and coverage is a refined version of the mask around the edges. But we need to be extra sure. Like

data.mask[:, coverage == 0] = True