Validate API results for grid cell area weighted zonal statistics

j08lue commented 7 months ago

Context

The VEDA data services APIs include a service to calculate zonal statistics (e.g. average, median, standard deviation, min, max) over (subsets of) a raster / gridded dataset.

Some of these datasets span large geographic areas and have projections where data points correspond to varying area / volume in their native projection. A common case is global datasets of climate variables on a regular lat/lon grid.

Some of the statistics we calculate are sensitive to the represented area - e.g. simply average data points despite their varying grid cell / pixel area will give a result that over-represents the data points with small area, in the case of lat/lon grids those at higher latitudes. For example, for a field of global sea surface temperatures that are lower towards the poles, the plain, unweighted average would be lower than the accurate, weighted one.

Method

To mitigate this inaccuracy, we implemented a method to reproject the source data to an equal area projection before calculating the statistics. For a lat/lon grid, we could also have calculated the grid cell / pixel area weights instead, since the formula is simple. However, since the API handles data in an arbitrary source projection, we chose the reprojection approach that should work for any source projection.

In the API code, what is going on is basically this: https://github.com/NASA-IMPACT/veda-config-ghg/issues/209#issuecomment-1790482261. I am happy to give more hints at where to find the exact code, if need be.

The ask

We need to test the robustness of the API implementation and build confidence that it is ready for a production release

What are standard cases that we expect (e.g. continental-scale averages over a global lat/lon dataset?) - do we get accurate results for these?
What are edge cases where it could fail? - how does the method perform?

Approach

What ever works - we have a live API for testing that has access to all the datasets in the GHG Center and VEDA STAC catalogs.

The results should be documented in an executable Jupyter Notebook, for future reference.

A notebook that runs a validation for a standard case is already in the VEDA docs and can be downloaded from the docs repo. You can launch this notebook directly in the GHG Center JupyterLab with this link: https://hub.ghg.center/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FNASA-IMPACT%2Fveda-docs%2F&urlpath=lab%2Ftree%2F%2Fnotebooks%2Ftutorials%2Fzonal-statistics-validation.ipynb&branch=main

Acceptance criteria

[ ] One or more independent, qualified specialists validated the method and shared their findings

abarciauskas-bgse commented 6 months ago

I started working on this here: https://github.com/NASA-IMPACT/veda-docs/tree/ab/zonal-stats-validation

I would like to add a few datasets to test from VEDA original, perhaps from the boreal. As there are already some differences, it may be necessary to dig deeper into the internals of each method. I am also interested in a comparison with exactextract as well.

abarciauskas-bgse commented 6 months ago

I started to look at datasets in VEDA which are not 4326. The notebook as-is has a requirement on using the 4326 projection. I started to look into how to clip the dataset using the same projection (such as 3857, by reprojecting the geojson, which is in 4326, to the same projection as the original dataset) and using rad2deg as way to get the weights of the grid cells for different conformal projections. I'm not sure of the results (see branch above, the notebook veda-docs/notebooks/tutorials/veda-zonal-statistics-validation.ipynb) but will take another look tomorrow. Any pointers are welcome.

abarciauskas-bgse commented 6 months ago

I have been comparing the results of the rioxarray clip method with titiler, rio_tiler and the methods @j08lue built with just calculating a simple mean in this notebook: https://github.com/NASA-IMPACT/veda-docs/blob/ab/zonal-stats-validation/notebooks/tutorials/different-stats-methods.ipynb

In order to assess some "edge cases" I did include 2 datasets which have a CRS that is not 4326 (3857 and 3395). I have included a notebook to inspect the CRS of datasets in VEDA STAC, https://github.com/NASA-IMPACT/veda-docs/blob/ab/zonal-stats-validation/notebooks/tutorials/get-epsgs.ipynb. At the bottom, you see the count. Most datasets are 4326. 17 are 3395 which is a projection used for all of the datasets associated with Houston. I have a few questions about the results so I would appreciate it if @j08lue and @vincentsarago took a look at the notebook and then we could meet to assess if any of it is useful.

I still agree finding another expert in raster calculations would be helpful to assess what other testing or documentation might be useful.

abarciauskas-bgse commented 5 months ago

@j08lue @vincentsarago thanks for meeting today so late! As I mentioned on the call I am going to pause on investigating some of the open questions I have in those notebook, but it sounds like @vincentsarago is willing to investigate why the coverage array is all 1s when both the dataset and the geojson bounding box are in the 4326 projection and the geojson bounding box only partially covers some of the data grid cells.

@vincentsarago I will send you the file that goes along with that notebook for the lis-global-da-totalprecip dataset via slack.

j08lue commented 5 months ago

Status update:

We added an option in rio-tiler and downstreams to disable resampling the pixels touched by the AOI, if no reprojection is required (dst_crs == src_crs). This is now nicely documented in the rio-tiler docs: https://cogeotiff.github.io/rio-tiler/advanced/statistics/#area-weighted-statistics
We are now well on top of what exactly happens in different cases (reprojection vs no reprojection, rectangular vs rotated geometry, etc.) and can benchmark and document these cases in a new version of the validation notebook.
Passing through the new option to titiler-pgstac and veda-backend is underway
- https://github.com/stac-utils/titiler-pgstac/pull/154

j08lue commented 4 months ago

Another status update:

@moradology suggested an end-to-end zonal statistics test https://github.com/NASA-IMPACT/veda-backend/pull/284
Working on a simpler test case https://gist.github.com/j08lue/436162ac8eb90241a98b2702416356ac, I discovered that the weighted statistics calculations in rio-tiler need revision: https://github.com/cogeotiff/rio-tiler/issues/680
Word of caution: pixel-geometry intersection fraction (aka coverage) calculation in rio-tiler can be quite inaccurate for low cover_scale and diagonal intersection, as this test shows: https://github.com/cogeotiff/rio-tiler/pull/679. This only affects the "fringes", though, where geometry partially overlaps pixels, and once we fixed the stats calculation, is probably still more accurate than simple "all touched" stats (all pixels touched by the geometry are counted equally).

j08lue commented 3 months ago

Statistics calculations were fixed in rio-tiler and are now benchmarked against exactextract:

https://github.com/cogeotiff/rio-tiler/pull/684

We successfully completed a comparison between a known method to calculate area-weighted statistics (for datasets in geographic coordinates), showing that the latest rio-tiler produces results that agree well enough with the known method:

Here is a gist with a notebook doing the comparison: https://gist.github.com/j08lue/eb65d3d816878e9bcc53376e42c0bae3#file-zonal-statistics-validation-south-america-total-fluxes-ipynb

There is a little gotcha in my implementation of the rio-tiler code - I use the area calculated from geographical coordinates to convert the rio-tiler calculated averages (it’s all we got) to total fluxes, so they are comparable and in familiar units. I also tried changing the geographic coordinates formula to calculate averages instead - the result was the same: that rio-tiler’s result is slightly lower (0.1%).

I’ll move forward and implement this method in the GHG Center Exploration & Analysis interface and then we can repeat the test there.

Something that remains to do is to repeat the same benchmark test as above in the UI, to get a validation of the end user result.

US-GHG-Center / veda-config-ghg