OCHA-DAP / ds-raster-stats

Pipelines for computing raster statistics from COG datasets
1 stars 1 forks source link

FloodScan - NA/NULL values in stats (mean) - admin 2 #20

Open zackarno opened 3 days ago

zackarno commented 3 days ago

There appears to be NA/NULL values in the zonal stats

here is the SQL query to see them:

SELECT "iso3", "pcode", "valid_date", "adm_level", "band", "mean"
FROM "floodscan"
WHERE ("adm_level" = 2.0) AND ("band" = 'SFED') AND (("mean" IS NULL))

They are all from MOZ and NGA. A lot of occurences over the dates:

iso3 num_occurences
MOZ 9813
NGA 127582

but only 14 pcodes in total with this issue on mean i assume it's the same for other stats except count and sum

iso3 num_pcodes
MOZ 1
NGA 13
hannahker commented 3 days ago

@zackarno I'll take a look into this! On first look though, it's not unexpected to see some NULL values (especially at the adm2 level). This happens in cases where the admin polygon is too small to have any pixel centroids contained within it.

zackarno commented 3 days ago

yea that makes sense. So we'd either need to adjust the raster stats method or develop guidance on how we should deal with this in downstream analysis.

For Floodscan use-case we are publishing datasets at admin 2 level so it seems not ideal to have to exclude admins from the datasets.

hannahker commented 2 days ago

So we'd either need to adjust the raster stats method or develop guidance on how we should deal with this in downstream analysis.

Yeah @zackarno I think our options would be to:

  1. Switch to weighted calculation method (a la exactextract)
  2. Publish with the disclaimer that some admins will have NA values. IMO, we'd say something like:

"Note that some administrative boundaries may not have summary statistics available. This happens when administrative polygons are sufficiently small relative to the size of the input raster dataset. In this case, we'd recommend performing your analysis across a larger spatial scale. For example, if you find values missing for a particular Admin 2 boundary, you may want to instead consider performing your analysis at the Admin 1 level."

I think we should do 1. eventually, but not prioritize at the moment and for now go with 2.

zackarno commented 2 days ago

yeah perhaps exactextract can be used in a future iteration/version. I think what you wrote sounds pretty good, but lets leave this open until a decision is made.

There are some additional complexities coming to mind and one is the fact that we will need to use both NA and Inf values in the outputs for different reasons. For example if all values in historical record are 0 or there is 0 variance we need to use something like NA, but we also will have an RP threshold above which values will be Inf..... still trying to think of the best way to do this all given that we want the users with excel-only skill to be easily able to work with the data and this column specifically in a quantitative way (i.e we can't mix in strings etc)