Closed j08lue closed 8 months ago
Here is a notebook comparing different methods: https://gist.github.com/j08lue/ac09ed69e5dd58f4058466ec3b108b99
It runs without extra AWS S3 configuration in https://hub.ghg.center/ - please download it into the hub and run it there, if you want to try out or fix things.
Once the validation is completed, we will add this notebook to the GHG Center docs.
The zonal average calculated with the current TiTiler (v0.2.3) implementation is identical to a plain (unweighted) zonal average computed with Xarray / rioxarray. So far so good.
Grid cell area weighting (here implemented by calculating area on a spherical Earth) makes a difference (of on average 35%) for CASA-GFED3 averaged over the Northern Hemisphere. The difference is negligible though for an area spanning ~2000 km north-south near the equator (Dem Rep Congo bbox, not shown here).
We need to repeat this calculation with the new TiTiler implementation, either running locally or deployed.
If you have any comments on this comparison or want to further iterate on it, please go ahead - contributions welcome!
@j08lue
I've replaced the titiler
endpoint with a direct rio-tiler call
from rio_tiler.io import Reader
def generate_stats(item, geojson, asset_name):
with Reader(item.assets[asset_name].href) as src:
feature = geojson["features"][0]
data = src.feature(feature, dst_crs=src.crs)
# Get the coverage % array
coverage_array = data.get_coverage_array(
feature,
)
stats = data.statistics(
coverage=coverage_array,
)
return {
**stats["b1"].model_dump(exclude_none=True),
"start_datetime": item.properties["start_datetime"],
}
but I'm getting the same results as with titiler.
Looking at the data and the AOI, it seems they are aligned
url = "s3://ghgc-data-store/casagfed-carbonflux-monthgrid-v3/GEOSCarb_CASAGFED3v3_NPP_Flux_Monthly_x720_y360_201712.tif"
with Reader(url) as src:
print(src.bounds)
print(src.dataset.transform)
(-180.0, -90.0, 180.0, 90.0)
| 0.50, 0.00,-180.00|
| 0.00,-0.50, 90.00|
| 0.00, 0.00, 1.00|
Because the AOI is perfectly aligned with the pixels, there is no reason for the weight to be something different than 1.
which explain why I don't see difference between weighted and non-weighted statistics.
😓 looking at the notebook I think there is a miss understood about what we are trying to achieve here!
in TiTiler/rio-tiler we added a method to take sub-pixel shape intersection in order to weight the statistics by the pixel coverage.
In the notebook you're using the area
created from an external dataset (area.nc
) to apply weights.
with xr.open_dataset("area.nc") as ds:
area_cdo = ds["cell_area"].values
...
data = xds_clip["band_data"].isel(band=0).to_masked_array()
weights = xds_clip["area"].to_masked_array()
weights.mask = data.mask
timeseries["average_weighted"].append((data * weights).sum() / weights.sum())
Those are two different things and applying weights from another dataset is not possible via titiler/rio-tiler directly!
Thanks for checking! The external weights (CDO-generated area.nc
) were just for comparison / validation. We very much expect the API to take care of the weighting (or reprojection?).
Had a clarifying call. Conclusion was that I'll try out the rio-tiler
snippet above and make use of the dst_crs
parameter to reproject to an equal-area projection before taking the statistics. If it works, we can expose that parameter in TiTiler, where it currently is not.
Furthermore, we discussed that climate data (mostly NetCDF or Zarr in lat/lon projection) specific methods might be better served by a package that is optimized for that kind of data, possibly titiler-xarray or a future evolution of a Xarray-serving service.
I did some testing with dst_crs
and a few different equal-area projections - not successful yet, unfortunately:
Updated notebook: https://gist.github.com/j08lue/ac09ed69e5dd58f4058466ec3b108b99
The very different behavior of different equal-area projections makes me think that the method of reprojecting the image and geometry to some different projection is perhaps not safe enough. Not a surprise - there probably does not exist one projection that works for all regions and sizes.
Since this method is most important for climate datasets (mostly on regular geographic coordinate grids), maybe we should focus on the approach to implement this for that specific case only.
Ok, turns out the rio-tiler
based function works well when I pick a good projection. Equal Area Cylindrical ("+proj=cea"
) gave the expected results as the only equal area projection I found in PROJ or EPSG.
I take my true / benchmark to be the lat/lon grid cell area weighted average (with weights from CDO or an equivalent function). For a dataset with great variation over latitude like global net primary production (NPP), the methods compare like this:
Where average_titiler
is the plain /cog/statistics
endpoint and underestimates NPP, because high-latitude low production is falsely weighted the same as higher production in lower latitudes.
Both the rio-tiler method above and a rasterio function I coded up that uses WarpedVRT in a few lines give the expected results.
@vincentsarago, I need to use this dst_crs=rasterio.crs.CRS.from_proj4("+proj=cea")
in the VEDA TiTiler /cog/statistics
in one way or the other, so the application can calculate accurate statistics.
How do we do this best? Expose dst_crs
with some heuristics to find out whether the user passed a PROJ, WKT, or EPSG code? 🤷 Or adding an area_weighted=True
flag? Anything is fine. I saw that nearest neighbor is hard-coded as the method for reprojecting to dst_crs
. That may not be ideal, but it was fine for my tests with large areas so far.
I'll do some more testing with other source image reference systems than geographic coordinates and smaller geometries, to be on the safe side, but let's move forward making a plan for the implementation in VEDA and (most importantly) GHG Center backend, please.
@j08lue I might have an idea!
We could have a custom algorithm which could be applied on the image before calculating the statistics. This will only be available for the /statistics [POST]
endpoint
Basically you would tell the tiler to apply algorithm=area_weight
(naming TBD). The area_weigh
algorithm will take most of the code from your notebook and will have to be registered in the algorithm dependency within the application (just like the colormap we just added)
This will be possible with next version of titiler https://github.com/developmentseed/titiler/pull/726 (but compatible with the next veda-backend release)
Algorithm sketch:
import numpy
from rio_tiler.models import ImageData
from rio_tiler.constants import WGS84_CRS
from titiler.core.algorithm.base import BaseAlgorithm
def _get_coord_specs(arr):
assert numpy.ndim(arr) == 1
delta = _get_unique_diff(arr)
return ((arr[0] - delta / 2), (arr[-1] + delta / 2), len(arr))
def surfaceAreaGrid(lat_specs, lon_specs):
"""
Construct a 2D array (nlat, nlon) where each element is the surface area of the corresponding grid in square meters.
lat_specs is a tuple, (starting latitude, ending latitude, latitude divisions)
lon_specs is a tuple, (starting longitude, ending longitude, longitude divisions)
Example: surfaceAreaGrid((-30,60,45), (20,80,120)) should return a (45,120) array with the surface areas of the
(2 degree lat x 0.5 degree lon) cells in square meters.
"""
lat_beg, lat_end, lat_div = lat_specs
lon_beg, lon_end, lon_div = lon_specs
R_e = 6371000.0 # Radius of Earth in meters
dLon = (numpy.pi / 180.) * (lon_end - lon_beg) / lon_div
dS = numpy.zeros((lat_div + 1, lon_div), numpy.float64)
Lats = (numpy.pi / 180.) * numpy.linspace(lat_beg, lat_end, lat_div + 1)
for i, lat in enumerate(Lats):
dS[i] = R_e * R_e * dLon * numpy.sin(lat)
dS = numpy.diff(dS, axis=0)
return dS
class WeightedArea(BaseAlgorithm):
"""Apply Area Weight."""
# metadata
input_nbands: int = 1
output_nbands: int = 1
def __call__(self, img: ImageData) -> ImageData:
"""apply weight."""
assert img.crs == WGS84_CRS, "'WeighedArea' algorithm only work with dataset in EPSG:4326 projection"
b1 = img.array[0].astype("float32")
# Compute weight array using img.bounds and shape
lat = numpy.linspace(img.bounds[1], img.bounds[3], img.height)
area_lat = _spherical_grid_cell_area(lat)
area_lat_2d = numpy.ones((img.height, img.width)) * area_lat[:, numpy.newaxis]
weights = surfaceAreaGrid(
(img.bounds[1], img.bounds[3], img.height),
(img.bounds[0], img.bounds[2], img.width),
)
arr = numpy.where(img.mask, img.array, img.array / weights)
return ImageData(
arr,
assets=img.assets,
crs=img.crs,
bounds=img.bounds,
band_names=img.band_names,
)
Oooh, I see dst_crs
is already available on POST to /cog/statistics
on the updated TiTiler endpoint https://test-raster.delta-backend.com/docs#/Cloud%20Optimized%20GeoTIFF/geojson_statistics_cog_statistics_post
I'll test that right away.
@j08lue you jinxed me! we commented two solution at the same time!
I need to use this dst_crs=rasterio.crs.CRS.from_proj4("+proj=cea") in the VEDA TiTiler /cog/statistics in one way or the other, so the application can calculate accurate statistics.
You should be able to pass dst_crs=+proj=cea
(you'll need to uri encode the proj string to %2Bproj%3Dcea
)
How do we do this best? Expose dst_crs with some heuristics to find out whether the user passed a PROJ, WKT, or EPSG code?
dst_crs
should be exposed in next release of the veda-backend
in titiler-pgstac: https://github.com/stac-utils/titiler-pgstac/blob/0.8.0/titiler/pgstac/factory.py#L1053 (0.8.0
)
in titiler: https://github.com/developmentseed/titiler/blob/0.15.3/src/titiler/core/titiler/core/factory.py#L446 (0.15.3
)
(Note to self: I need to update the veda-backend requirements for titiler)
Or adding an area_weighted=True flag?
adding a new flag is a bit more complex!
I saw that nearest neighbor is hard-coded as the method for reprojecting to dst_crs. That may not be ideal, but it was fine for my tests with large areas so far.
starting with titiler 0.15.2, there are now 2 resampling options:
nearest
.nearest
.ref: https://github.com/developmentseed/titiler/blob/0.15.3/CHANGES.md#0152-2023-10-23
I'll do some more testing with other source image reference systems than geographic coordinates and smaller geometries, to be on the safe side, but let's move forward making a plan for the implementation in VEDA and (most importantly) GHG Center backend, please.
If we are happy with using dst_crs=+proj=cea
, then it's should be already available in next release of veda-backend 🙏
This is awesome news, @vincentsarago!
I'll do some more testing with our main use case - datasets with WGS84 coordinates. Scientific accuracy is key. If I find that the dst_crs
method is not reliable enough, we may need to consider the custom algorithm option you suggested. But for now I expect that we are very close to having all we need.
Hmm, getting a 500 error with a POST request to
https://test-raster.delta-backend.com/cog/statistics?url=s3%3A%2F%2Fveda-data-store-staging%2Fno2-monthly%2FOMI_trno2_0.10x0.10_202109_Col3_V4.nc.tif&dst_crs=%2Bproj%3Dcea
while the request without dst_crs
works. Simple example reproducing the issue:
RASTER_API_URL = "https://test-raster.delta-backend.com"
ITEM_ASSET_URL = "s3://veda-data-store-staging/no2-monthly/OMI_trno2_0.10x0.10_202109_Col3_V4.nc.tif"
AOI = {
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {},
"geometry": {
"coordinates": [
[
[
-115,
82
],
[
-115,
-82
],
[
-43,
-82
],
[
-43,
82
],
[
-115,
82
]
]
],
"type": "Polygon"
}
}
]
}
result = requests.post(
f"{RASTER_API_URL}/cog/statistics",
params={
"url": ITEM_ASSET_URL,
"dst_crs": "+proj=cea"
},
json=AOI,
)
result.raise_for_status()
result.json()
Do you have access to the function logs to see what is going wrong, @vincentsarago?
Note:
dst_crs
is not available for /statistics [GET]
only for the [POST] geojson
endpoint
There might be a bug in rio-tiler or titiler
[ERROR] 2023-11-10T10:51:37.886Z 49d3191d-81ad-4546-801f-0ad3e7855b5d An error occurred running the application.
Traceback (most recent call last):
File "/tmp/pip-target-f75zho4z/lib/python/mangum/protocols/http.py", line 58, in run
File "/tmp/pip-target-f75zho4z/lib/python/fastapi/applications.py", line 1106, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette/applications.py", line 122, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/errors.py", line 184, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/errors.py", line 162, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/titiler/core/middleware.py", line 63, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette_cramjam/middleware.py", line 112, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette_cramjam/middleware.py", line 142, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/cors.py", line 83, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/exceptions.py", line 79, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/exceptions.py", line 68, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/fastapi/middleware/asyncexitstack.py", line 20, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/fastapi/middleware/asyncexitstack.py", line 17, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette/routing.py", line 718, in __call__
File "/tmp/pip-target-f75zho4z/lib/python/starlette/routing.py", line 276, in handle
File "/tmp/pip-target-f75zho4z/lib/python/starlette/routing.py", line 66, in app
File "/tmp/pip-target-f75zho4z/lib/python/fastapi/routing.py", line 274, in app
File "/tmp/pip-target-f75zho4z/lib/python/fastapi/routing.py", line 193, in run_endpoint_function
File "/tmp/pip-target-f75zho4z/lib/python/starlette/concurrency.py", line 41, in run_in_threadpool
File "/tmp/pip-target-f75zho4z/lib/python/anyio/to_thread.py", line 33, in run_sync
File "/tmp/pip-target-f75zho4z/lib/python/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
File "/tmp/pip-target-f75zho4z/lib/python/anyio/_backends/_asyncio.py", line 807, in run
File "/tmp/pip-target-f75zho4z/lib/python/titiler/core/factory.py", line 482, in geojson_statistics
File "/tmp/pip-target-f75zho4z/lib/python/rio_tiler/models.py", line 826, in get_coverage_array
File "/tmp/pip-target-f75zho4z/lib/python/rasterio/env.py", line 398, in wrapper
File "/tmp/pip-target-f75zho4z/lib/python/rasterio/env.py", line 630, in wrapper
File "/tmp/pip-target-f75zho4z/lib/python/rasterio/warp.py", line 104, in transform_geom
File "rasterio/_warp.pyx", line 114, in rasterio._warp._transform_geom
File "rasterio/_warp.pyx", line 71, in rasterio._warp._transform_single_geom
File "rasterio/_features.pyx", line 693, in rasterio._features.OGRGeomBuilder.build
ValueError: Unsupported geometry type Feature
☝️ this is a log from titiler.xyz (devseed aws)
fix https://github.com/cogeotiff/rio-tiler/pull/653
Sadly this will require a new deployment to catch newest rio-tiler version
I can confirm that with this bug fix (testing against dev.ghg.center/api/raster
), the method to reproject the data before extracting statistics gives the same average as calculating weights by hand with the geographic coordinates formula.
I will prepare the notebooks for our (GHG Center and VEDA) docs, so we have somewhere to point to for others to validate our methods.
To complete our goal of providing grid-cell area weighted zonal statistic
we want to make sure that our zonal statistics calculations are accurate.
For that, we need to quantify