Compare different zonal statistics method to quantify the effect of area weights and cross-validate the methods

US-GHG-Center / veda-config-ghg

Veda config for GHG

https://ghg-demo.netlify.app

Other

3 stars 15 forks source link

Compare different zonal statistics method to quantify the effect of area weights and cross-validate the methods #209

Closed j08lue closed 8 months ago

j08lue commented 9 months ago

To complete our goal of providing grid-cell area weighted zonal statistic

https://github.com/NASA-IMPACT/veda-architecture/issues/334

we want to make sure that our zonal statistics calculations are accurate.

For that, we need to quantify

What difference grid cell area weighting makes
Whether the new implementation in TiTiler is accurate

j08lue commented 9 months ago

Here is a notebook comparing different methods: https://gist.github.com/j08lue/ac09ed69e5dd58f4058466ec3b108b99

It runs without extra AWS S3 configuration in https://hub.ghg.center/ - please download it into the hub and run it there, if you want to try out or fix things.

Once the validation is completed, we will add this notebook to the GHG Center docs.

Main takeaways

The zonal average calculated with the current TiTiler (v0.2.3) implementation is identical to a plain (unweighted) zonal average computed with Xarray / rioxarray. So far so good.

Grid cell area weighting (here implemented by calculating area on a spherical Earth) makes a difference (of on average 35%) for CASA-GFED3 averaged over the Northern Hemisphere. The difference is negligible though for an area spanning ~2000 km north-south near the equator (Dem Rep Congo bbox, not shown here).

Next steps

We need to repeat this calculation with the new TiTiler implementation, either running locally or deployed.

If you have any comments on this comparison or want to further iterate on it, please go ahead - contributions welcome!

vincentsarago commented 8 months ago

@j08lue

I've replaced the titiler endpoint with a direct rio-tiler call

from rio_tiler.io import Reader

def generate_stats(item, geojson, asset_name):
    with Reader(item.assets[asset_name].href) as src:
        feature = geojson["features"][0]

        data = src.feature(feature, dst_crs=src.crs)

        # Get the coverage % array
        coverage_array = data.get_coverage_array(
            feature,
        )

        stats = data.statistics(
            coverage=coverage_array,
        )       

    return {
        **stats["b1"].model_dump(exclude_none=True),
        "start_datetime": item.properties["start_datetime"],
    }

but I'm getting the same results as with titiler.

Looking at the data and the AOI, it seems they are aligned

url = "s3://ghgc-data-store/casagfed-carbonflux-monthgrid-v3/GEOSCarb_CASAGFED3v3_NPP_Flux_Monthly_x720_y360_201712.tif"
with Reader(url) as src:
    print(src.bounds)
    print(src.dataset.transform)

(-180.0, -90.0, 180.0, 90.0)
| 0.50, 0.00,-180.00|
| 0.00,-0.50, 90.00|
| 0.00, 0.00, 1.00|

Because the AOI is perfectly aligned with the pixels, there is no reason for the weight to be something different than 1. which explain why I don't see difference between weighted and non-weighted statistics.

vincentsarago commented 8 months ago

😓 looking at the notebook I think there is a miss understood about what we are trying to achieve here!

in TiTiler/rio-tiler we added a method to take sub-pixel shape intersection in order to weight the statistics by the pixel coverage.

In the notebook you're using the area created from an external dataset (area.nc) to apply weights.

with xr.open_dataset("area.nc") as ds:
    area_cdo = ds["cell_area"].values
...

        data = xds_clip["band_data"].isel(band=0).to_masked_array()
        weights = xds_clip["area"].to_masked_array()
        weights.mask = data.mask
        timeseries["average_weighted"].append((data * weights).sum() / weights.sum())

Those are two different things and applying weights from another dataset is not possible via titiler/rio-tiler directly!

j08lue commented 8 months ago

Thanks for checking! The external weights (CDO-generated area.nc) were just for comparison / validation. We very much expect the API to take care of the weighting (or reprojection?).

j08lue commented 8 months ago

Had a clarifying call. Conclusion was that I'll try out the rio-tiler snippet above and make use of the dst_crs parameter to reproject to an equal-area projection before taking the statistics. If it works, we can expose that parameter in TiTiler, where it currently is not.

Furthermore, we discussed that climate data (mostly NetCDF or Zarr in lat/lon projection) specific methods might be better served by a package that is optimized for that kind of data, possibly titiler-xarray or a future evolution of a Xarray-serving service.

j08lue commented 8 months ago

I did some testing with dst_crs and a few different equal-area projections - not successful yet, unfortunately:

Updated notebook: https://gist.github.com/j08lue/ac09ed69e5dd58f4058466ec3b108b99

The very different behavior of different equal-area projections makes me think that the method of reprojecting the image and geometry to some different projection is perhaps not safe enough. Not a surprise - there probably does not exist one projection that works for all regions and sizes.

Since this method is most important for climate datasets (mostly on regular geographic coordinate grids), maybe we should focus on the approach to implement this for that specific case only.

j08lue commented 8 months ago

Ok, turns out the rio-tiler based function works well when I pick a good projection. Equal Area Cylindrical ("+proj=cea") gave the expected results as the only equal area projection I found in PROJ or EPSG.

I take my true / benchmark to be the lat/lon grid cell area weighted average (with weights from CDO or an equivalent function). For a dataset with great variation over latitude like global net primary production (NPP), the methods compare like this:

Where average_titiler is the plain /cog/statistics endpoint and underestimates NPP, because high-latitude low production is falsely weighted the same as higher production in lower latitudes.

Both the rio-tiler method above and a rasterio function I coded up that uses WarpedVRT in a few lines give the expected results.

tl;dr

@vincentsarago, I need to use this dst_crs=rasterio.crs.CRS.from_proj4("+proj=cea") in the VEDA TiTiler /cog/statistics in one way or the other, so the application can calculate accurate statistics.

How do we do this best? Expose dst_crs with some heuristics to find out whether the user passed a PROJ, WKT, or EPSG code? 🤷 Or adding an area_weighted=True flag? Anything is fine. I saw that nearest neighbor is hard-coded as the method for reprojecting to dst_crs. That may not be ideal, but it was fine for my tests with large areas so far.

I'll do some more testing with other source image reference systems than geographic coordinates and smaller geometries, to be on the safe side, but let's move forward making a plan for the implementation in VEDA and (most importantly) GHG Center backend, please.

vincentsarago commented 8 months ago

@j08lue I might have an idea!

We could have a custom algorithm which could be applied on the image before calculating the statistics. This will only be available for the /statistics [POST] endpoint

Basically you would tell the tiler to apply algorithm=area_weight (naming TBD). The area_weigh algorithm will take most of the code from your notebook and will have to be registered in the algorithm dependency within the application (just like the colormap we just added)

This will be possible with next version of titiler https://github.com/developmentseed/titiler/pull/726 (but compatible with the next veda-backend release)

Algorithm sketch:


import numpy
from rio_tiler.models import ImageData
from rio_tiler.constants import WGS84_CRS

from titiler.core.algorithm.base import BaseAlgorithm

def _get_coord_specs(arr):
    assert numpy.ndim(arr) == 1
    delta = _get_unique_diff(arr)
    return ((arr[0] - delta / 2), (arr[-1] + delta / 2), len(arr))

def surfaceAreaGrid(lat_specs, lon_specs):
    """
    Construct a 2D array (nlat, nlon) where each element is the surface area of the corresponding grid in square meters.
    lat_specs is a tuple, (starting latitude, ending latitude, latitude divisions)
    lon_specs is a tuple, (starting longitude, ending longitude, longitude divisions)

    Example: surfaceAreaGrid((-30,60,45), (20,80,120)) should return a (45,120) array with the surface areas of the
    (2 degree lat x 0.5 degree lon) cells in square meters.
    """
    lat_beg, lat_end, lat_div = lat_specs
    lon_beg, lon_end, lon_div = lon_specs

    R_e = 6371000.0 # Radius of Earth in meters
    dLon = (numpy.pi / 180.) * (lon_end - lon_beg) / lon_div

    dS = numpy.zeros((lat_div + 1, lon_div), numpy.float64)

    Lats = (numpy.pi / 180.) * numpy.linspace(lat_beg, lat_end, lat_div + 1)
    for i, lat in enumerate(Lats):
        dS[i] = R_e * R_e * dLon * numpy.sin(lat)

    dS = numpy.diff(dS, axis=0)
    return dS

class WeightedArea(BaseAlgorithm):
    """Apply Area Weight."""

    # metadata
    input_nbands: int = 1
    output_nbands: int = 1

    def __call__(self, img: ImageData) -> ImageData:
        """apply weight."""
        assert img.crs == WGS84_CRS, "'WeighedArea' algorithm only work with dataset in EPSG:4326 projection"

        b1 = img.array[0].astype("float32")

        # Compute weight array using img.bounds and shape
        lat = numpy.linspace(img.bounds[1], img.bounds[3], img.height)

        area_lat = _spherical_grid_cell_area(lat)
        area_lat_2d = numpy.ones((img.height, img.width)) * area_lat[:, numpy.newaxis]

        weights = surfaceAreaGrid(
            (img.bounds[1], img.bounds[3], img.height),
            (img.bounds[0], img.bounds[2], img.width),
        )

        arr = numpy.where(img.mask, img.array, img.array / weights)

        return ImageData(
            arr,
            assets=img.assets,
            crs=img.crs,
            bounds=img.bounds,
            band_names=img.band_names,
        )

[x] update titiler-pgstac (once titiler will be released)
[ ] write the custom algorithm which compute the area weights
[ ] add the algorithm to the raster api

j08lue commented 8 months ago

Oooh, I see dst_crs is already available on POST to /cog/statistics on the updated TiTiler endpoint https://test-raster.delta-backend.com/docs#/Cloud%20Optimized%20GeoTIFF/geojson_statistics_cog_statistics_post

I'll test that right away.

vincentsarago commented 8 months ago

@j08lue you jinxed me! we commented two solution at the same time!

I need to use this dst_crs=rasterio.crs.CRS.from_proj4("+proj=cea") in the VEDA TiTiler /cog/statistics in one way or the other, so the application can calculate accurate statistics.

You should be able to pass dst_crs=+proj=cea (you'll need to uri encode the proj string to %2Bproj%3Dcea)

How do we do this best? Expose dst_crs with some heuristics to find out whether the user passed a PROJ, WKT, or EPSG code?

dst_crs should be exposed in next release of the veda-backend

in titiler-pgstac: https://github.com/stac-utils/titiler-pgstac/blob/0.8.0/titiler/pgstac/factory.py#L1053 (0.8.0) in titiler: https://github.com/developmentseed/titiler/blob/0.15.3/src/titiler/core/titiler/core/factory.py#L446 (0.15.3)

(Note to self: I need to update the veda-backend requirements for titiler)

Or adding an area_weighted=True flag?

adding a new flag is a bit more complex!

I saw that nearest neighbor is hard-coded as the method for reprojecting to dst_crs. That may not be ideal, but it was fine for my tests with large areas so far.

starting with titiler 0.15.2, there are now 2 resampling options:

resampling: RasterIO resampling algorithm. Defaults to nearest.
reproject: WarpKernel resampling algorithm (only used when doing re-projection). Defaults to nearest.

ref: https://github.com/developmentseed/titiler/blob/0.15.3/CHANGES.md#0152-2023-10-23

I'll do some more testing with other source image reference systems than geographic coordinates and smaller geometries, to be on the safe side, but let's move forward making a plan for the implementation in VEDA and (most importantly) GHG Center backend, please.

If we are happy with using dst_crs=+proj=cea, then it's should be already available in next release of veda-backend 🙏

j08lue commented 8 months ago

This is awesome news, @vincentsarago!

I'll do some more testing with our main use case - datasets with WGS84 coordinates. Scientific accuracy is key. If I find that the dst_crs method is not reliable enough, we may need to consider the custom algorithm option you suggested. But for now I expect that we are very close to having all we need.

j08lue commented 8 months ago

Hmm, getting a 500 error with a POST request to

https://test-raster.delta-backend.com/cog/statistics?url=s3%3A%2F%2Fveda-data-store-staging%2Fno2-monthly%2FOMI_trno2_0.10x0.10_202109_Col3_V4.nc.tif&dst_crs=%2Bproj%3Dcea

while the request without dst_crs works. Simple example reproducing the issue:

RASTER_API_URL = "https://test-raster.delta-backend.com"

ITEM_ASSET_URL = "s3://veda-data-store-staging/no2-monthly/OMI_trno2_0.10x0.10_202109_Col3_V4.nc.tif"

AOI = {
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "coordinates": [
          [
            [
              -115,
              82
            ],
            [
              -115,
              -82
            ],
            [
              -43,
              -82
            ],
            [
              -43,
              82
            ],
            [
              -115,
              82
            ]
          ]
        ],
        "type": "Polygon"
      }
    }
  ]
}

result = requests.post(
    f"{RASTER_API_URL}/cog/statistics",
    params={
        "url": ITEM_ASSET_URL,
        "dst_crs": "+proj=cea"
    },
    json=AOI,
)
result.raise_for_status()
result.json()

Do you have access to the function logs to see what is going wrong, @vincentsarago?

vincentsarago commented 8 months ago

Note:

https://test-raster.delta-backend.com/cog/statistics?url=s3%3A%2F%2Fveda-data-store-staging%2Fno2-monthly%2FOMI_trno2_0.10x0.10_202109_Col3_V4.nc.tif&dst_crs=%2Bproj%3Dcea

dst_crs is not available for /statistics [GET] only for the [POST] geojson endpoint

There might be a bug in rio-tiler or titiler

[ERROR] 2023-11-10T10:51:37.886Z    49d3191d-81ad-4546-801f-0ad3e7855b5d    An error occurred running the application.
Traceback (most recent call last):
  File "/tmp/pip-target-f75zho4z/lib/python/mangum/protocols/http.py", line 58, in run
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/applications.py", line 1106, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/applications.py", line 122, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/errors.py", line 184, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/errors.py", line 162, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/titiler/core/middleware.py", line 63, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette_cramjam/middleware.py", line 112, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette_cramjam/middleware.py", line 142, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/cors.py", line 83, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/exceptions.py", line 79, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/exceptions.py", line 68, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/middleware/asyncexitstack.py", line 20, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/middleware/asyncexitstack.py", line 17, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/routing.py", line 718, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/routing.py", line 276, in handle
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/routing.py", line 66, in app
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/routing.py", line 274, in app
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/routing.py", line 193, in run_endpoint_function
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/concurrency.py", line 41, in run_in_threadpool
  File "/tmp/pip-target-f75zho4z/lib/python/anyio/to_thread.py", line 33, in run_sync
  File "/tmp/pip-target-f75zho4z/lib/python/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
  File "/tmp/pip-target-f75zho4z/lib/python/anyio/_backends/_asyncio.py", line 807, in run
  File "/tmp/pip-target-f75zho4z/lib/python/titiler/core/factory.py", line 482, in geojson_statistics
  File "/tmp/pip-target-f75zho4z/lib/python/rio_tiler/models.py", line 826, in get_coverage_array
  File "/tmp/pip-target-f75zho4z/lib/python/rasterio/env.py", line 398, in wrapper
  File "/tmp/pip-target-f75zho4z/lib/python/rasterio/env.py", line 630, in wrapper
  File "/tmp/pip-target-f75zho4z/lib/python/rasterio/warp.py", line 104, in transform_geom
  File "rasterio/_warp.pyx", line 114, in rasterio._warp._transform_geom
  File "rasterio/_warp.pyx", line 71, in rasterio._warp._transform_single_geom
  File "rasterio/_features.pyx", line 693, in rasterio._features.OGRGeomBuilder.build
ValueError: Unsupported geometry type Feature

☝️ this is a log from titiler.xyz (devseed aws)

vincentsarago commented 8 months ago

fix https://github.com/cogeotiff/rio-tiler/pull/653

Sadly this will require a new deployment to catch newest rio-tiler version

j08lue commented 8 months ago

I can confirm that with this bug fix (testing against dev.ghg.center/api/raster), the method to reproject the data before extracting statistics gives the same average as calculating weights by hand with the geographic coordinates formula.

I will prepare the notebooks for our (GHG Center and VEDA) docs, so we have somewhere to point to for others to validate our methods.