OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.92k stars 2.56k forks source link

GTI stac-geoparquet driver support for multiple band assets #11321

Open rbavery opened 22 hours ago

rbavery commented 22 hours ago

Feature description

I'd like to be able to load STAC items that have bands referenced as distinct geotiffs. The current GTI docs indicate it is limited to using a location field that points to one href per STAC Item.

An example I am working with is here: https://github.com/stac-utils/stac-rs/issues/528#issuecomment-2489865938

install

conda create -n test python=3.11 rasterio==1.4.2 gdal==3.10 libgdal==3.10 rioxarray libgdal-arrow-parquet 
conda activate test
# you may have to run `rustup default 1.80` to pip install stacrs
pip install stacrs

setting up the query

from datetime import datetime, timedelta
import stacrs
def convert_to_iso_datetime_range(simple_date_range):
    start_date_str, end_date_str = simple_date_range.split('/')
    start_date = datetime.strptime(start_date_str, "%Y-%m-%d")
    end_date = datetime.strptime(end_date_str, "%Y-%m-%d")
    start_date_iso = start_date.strftime("%Y-%m-%dT%H:%M:%S.%fZ")[:-3] + "Z"
    end_date_iso = (end_date - timedelta(seconds=1)).strftime("%Y-%m-%dT%H:%M:%S.%fZ")[:-3] + "Z"
    return f"{start_date_iso}/{end_date_iso}"

max_cloud_cover = 15
max_nodata_percent = 15
query_args = {
            "eo:cloud_cover": {"lt": max_cloud_cover},
            "s2:nodata_pixel_percentage": {"lt": max_nodata_percent},
        }
simple_date_range = "2023-01-01/2024-01-01"
iso_datetime_range = convert_to_iso_datetime_range(simple_date_range)
collection_id = "sentinel-2-c1-l2a"
catalog = "https://earth-search.aws.element84.com/v1"
lon, lat = -111.760826, 34.871002

# stack args
assets=["rededge1", "rededge2", "rededge3", "nir", "swir16", "swir22"]
rescale=False
xy_coords="center"
stack_id = "grid:code"
stack_dim = "time"

stacrs.search_to(
    "items.parquet",
    catalog,
    collections=collection_id,
    intersects={"type": "Point", "coordinates": [lon, lat]},
    sortby="-properties.datetime",
    max_items=20,
    datetime=iso_datetime_range,
    query=query_args
)

workaround to access the blue band and set it's href as the location field to confirm GTI works

import geopandas as gpd
df = gpd.read_parquet("items.parquet")
df['location'] = df.assets.apply(lambda x : x['blue']['href'])
df.to_parquet("items_with_location.parquet")
ds = xarray.open_dataset("GTI:/home/rave/dask_on_ray_poc/items_with_location.parquet", engine = "rasterio")
ds.values

Ideally the GTI driver would interpret the assets section of the stac-geoparquet and load all the bands, if the assets only referenced bands. But this isn't the case

df.assets[0].keys()
dict_keys(['swir16', 'tileinfo_metadata', 'scl', 'thumbnail', 'nir08', 'swir22', 'product_metadata', 'preview', 'nir09', 'nir', 'rededge2', 'rededge3', 'rededge1', 'aot', 'red', 'wvp', 'coastal', 'blue', 'cloud', 'visual', 'snow', 'granule_metadata', 'green'])

I'm wondering if GTI could look for a different column that points to a list of asset keys that refer to bands that also specifies band order for assembling a multi band mosaic. then it could handle fetching the hrefs from the stac-geoparquet assets section.

Additional context

This would be a nice quality of life feature for various users, top of mind for me is the TorchGeo project which is exploring how to easily make stac queries, persist them, and create mosaics of large raster datasets for model training.

I can help by finding someone to PR if I'm not able to do this work myself and if maintainers don't want to take this on but think it would still be valuable.

rouault commented 21 hours ago

Attaching the related items.parquet (as it takes quite a bit to have the setup to generate it): items.parquet.zip

This ticket intersects the previous one of #11317 , so with #11319 , now it is possible to specify the LOCATION_FIELD open option to get one particular asset/band:

$ GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR gdalinfo GTI:items.parquet -oo LOCATION_FIELD=assets.rededge3.href
Driver: GTI/GDAL Raster Tile Index
Files: none associated
Size is 6018, 4962
Coordinate System is:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        MEMBER["World Geodetic System 1984 (G2296)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Origin = (-112.099469246340178,35.243022132412960)
Pixel Size = (0.000200522510948,-0.000200522510948)
Corner Coordinates:
Upper Left  (-112.0994692,  35.2430221) (112d 5'58.09"W, 35d14'34.88"N)
Lower Left  (-112.0994692,  34.2480294) (112d 5'58.09"W, 34d14'52.91"N)
Upper Right (-110.8927248,  35.2430221) (110d53'33.81"W, 35d14'34.88"N)
Lower Right (-110.8927248,  34.2480294) (110d53'33.81"W, 34d14'52.91"N)
Center      (-111.4960970,  34.7455258) (111d29'45.95"W, 34d44'43.89"N)
Band 1 Block=256x256 Type=UInt16, ColorInterp=Gray
  NoData Value=0

So with that and classic VRT, an admitedly not so ergonomic solution would be to create a VRT per band, and assemble all those VRT, like:

GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR gdal_translate  GTI:items.parquet -oo LOCATION_FIELD=assets.rededge1.href rededge1.vrt
GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR gdal_translate  GTI:items.parquet -oo LOCATION_FIELD=assets.rededge3.href rededge3.vrt
GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR gdalbuildvrt -separate out.vrt rededge1.vrt rededge3.vrt

It would be indeed better to be able to specify maybe -oo LOCATION_FIELD=assets.*.href to ask to create automatically a multi-band GTI from the assets.

That said that would be tricky here because Sentinel 2 bands don't have all the same resolution, whereas in a GDAL dataset all bands must have the same resolution. So probably that internally the GTI driver would need to do the above workaround and align on the 10m resolution, or do like then Sentinel2 driver and expose one subdataset for the 10m bands, one for the 20 m ones and one for the 60m ones

rbavery commented 4 hours ago

So probably that internally the GTI driver would need to do the above workaround and align on the 10m resolution, or do like then Sentinel2 driver and expose one subdataset for the 10m bands, one for the 20 m ones and one for the 60m ones

Thanks for the work on this! The second option sounds interesting. I'd like it if there was some limited set of sensible choices for resolution for a given STAC collection that GTI could derive.

My initial expectation was that resolution would be a user specified arg but that's only because I've used the opendatacube odc.stac.load function before

https://odc-stac.readthedocs.io/en/latest/_api/odc.stac.load.html#odc.stac.load

for example

ds = odc.stac.load(
    stac_items_list,
    resolution=20
)
mdsumner commented 1 hour ago

odc exposes a limited set of target output grid params, and heuristics to pick a UTM zone if not specified - GTI's virtualness is offering a slot to add this explicitly, but uses similar heuristics to make a choice. It's a kind of slippery user/ux area where you want maximum flexibility (specifying one thing, or the entire grid details, or anything in between including no-user input).

There's all kinds of things these heuristics+rules need, like pixel alignment, and resolution-choice (good for when you know), dimension-choice (good for when you want a region for a quick look at n-pixels) - and there's no single framework to guide the user well here, and I don't this this is well understood (entirely off topic but just wanted to add a bit here).