Add more sample data to eoAPI

j08lue commented 3 weeks ago

For testing the coverages API etc, we could use some more sample data in our eoAPI dev catalog.

A good example would be multi-spectral data, ideally Sentinel-2 L2A.

Since the EOEPCA+ Kubernetes cluster is deployed in CreoDIAS, perhaps we could even load assets directly from the Sentinel-2 collection in CreoDIAS' S3?

They have a STAC catalog: https://pgstac.demo.cloudferro.com/collections/sentinel-2-l2a/items

It keeps the cloud-loadable asset hrefs under "alternate assets", which TiTiler-PgSTAC currently does not support.

https://github.com/stac-utils/titiler-pgstac/discussions/181

But we could probably work around that - we will probably want to copy the STAC items to our catalog anyways.

Either way, a small subset of Sentinel-2 L2A scenes would be great to include. It can be regionally and temporally limited, perhaps even just 2x2 MGRS tiles for a year or so.

Acceptance criteria

[ ] EOEPCA eoAPI STAC has a collection of Sentinel-2 L2A scenes that we can use to test / demo TiTiler and Stacture with

j08lue commented 1 week ago

The CDSE Sentinel-2 L2A data are in JPEG2000. Performance / efficiency of reading overviews from those is not great, as @vincentsarago documented here.

What alternatives do we have? Looking at the CloudFerro STAC https://radiantearth.github.io/stac-browser/#/external/https://pgstac.demo.cloudferro.com - the only collection with COGs seems to be Sentinel-1 Ground Range Detected (GRD).

It uses the alternate assets extension for S3 links, too.

j08lue commented 1 week ago

We may need to add alternate assets support to eoapi-k8s:

https://github.com/developmentseed/eoapi-k8s/issues/137

As discussed here:

https://github.com/stac-utils/titiler-pgstac/discussions/181

jonas-eberle commented 1 week ago

Ok, the CDSE Sentinel-2 L2A data are in JPEG2000. Performance / efficiency of reading overviews from those is not great, as @vincentsarago documented here.

This blog post is quite old. Are the results still valid? The visualization services from Sinergise and Copernicus Dataspace Ecosystem use the JPEG2000 format.

What alternatives do we have? Looking at the CloudFerro STAC https://radiantearth.github.io/stac-browser/#/external/https://pgstac.demo.cloudferro.com - the only collection with COGs seems to be Sentinel-1 Ground Range Detected (GRD).

Sentinel-1 is not an option as you need to conduct some pre-processing steps.

vincentsarago commented 1 week ago

This blog post is quite old. Are the results still valid? The visualization services from Sinergise and Copernicus Dataspace Ecosystem use the JPEG2000 format.

If I remember well Sinergise use a proprietary driver to read the JPEG2000. GDAL maintainers did some improvement in GDAL and OpenJPEG drivers but this is still not as efficient as COG.

Sentinel-1 is not an option as you need to conduct some pre-processing steps.

What kind of pre-processing? you can visualize Sentinel-1 GRD which are stored as COGs

j08lue commented 1 week ago

The GRD data CloudFerro hosts is already terrain-corrected, at least. The thumbnails they reference look ok?

But perhaps we could still use the JPEG2000 Sentinel-2 L2A for demo purposes and see how it goes in terms of speed and GET requests to CloudFerro S3.

j08lue commented 1 week ago

Note on CreoDIAS access from TiTiler deployed in EOEPCA k8s: fundamentally, we should have access. Might need to generate some kind of credentials (@rconway knows).

MathewNWSH commented 1 week ago

Note on CreoDIAS access from TiTiler deployed in EOEPCA k8s: fundamentally, we should have access. Might need to generate some kind of credentials (@rconway knows).

s3 key gen on CDSE (free with predefined monthly quota limits) https://documentation.dataspace.copernicus.eu/APIs/S3.html s3 key extraction on Creodias (data transfer is not limited even on the smallest machines) https://creodias.docs.cloudferro.com/en/latest/eodata/How-to-get-credentials-used-for-accessing-EODATA-on-a-cloud-VM-on-Creodias.html

hope this will help

j08lue commented 2 days ago

Status - we decided to

select a nice subset of the Sentinel-2 L2A collection in CDSA STAC
download the STAC metadata (collection + items) for that subset
move alternate assets s3 links to main href and (cleanup) remove alternate assets references
load the data into eoAPI EOEPCA+ with assistance from @ranchodeluxe
find out whether our deployed app in EOEPCA k8s has direct access to the data in CDSE/CloudFerro - we can test that up-front, too, if we can get a /cog endpoint on eoAPI? https://eoapi.develop.eoepca.org/raster/api.html

j08lue commented 2 days ago

Regarding which subset, anything easy enough to handle is fine.

2024 to date, all of Europe?
past 2 years, whatever country/region
2023 Iceland, perhaps filtered by cloud cover <10% or so? 🤷 🇮🇸 🌋 🧊 https://www.copernicus.eu/en/media/image-day-gallery/new-eruptive-phase-icelands-fagradalsfjall-volcano

Btw, the Planetary Computer Explorer can generate nice code snippets for querying.

Python snippet

```python from pystac_client import Client # Search against the Planetary Computer STAC API catalog = Client.open( "https://planetarycomputer.microsoft.com/api/stac/v1" ) # Define your area of interest aoi = { "type": "Polygon", "coordinates": [ [ [-25.912381925618888, 63.225703118874776], [-12.293771949697515, 63.225703118874776], [-12.293771949697515, 66.68888112373057], [-25.912381925618888, 66.68888112373057], [-25.912381925618888, 63.225703118874776] ] ] } # Define your temporal range daterange = {"interval": ["2023-01-01T00:00:00Z", "2023-12-30T23:59:59Z"]} # Define your search with CQL2 syntax search = catalog.search(filter_lang="cql2-json", filter={ "op": "and", "args": [ {"op": "s_intersects", "args": [{"property": "geometry"}, aoi]}, {"op": "anyinteracts", "args": [{"property": "datetime"}, daterange]}, {"op": "=", "args": [{"property": "collection"}, "sentinel-2-l2a"]}, {"op": "<=", "args": [{"property": "eo:cloud_cover"}, 20]} ] }) # Grab the first item from the search results first_item = next(search.get_items()) ```

j08lue commented 2 days ago

Btw, the collection metadata also needs a bit of cleanup after removing the alternate assets: item_assets and auth:schemes and perhaps stac_extensions.

ciaransweet commented 2 days ago

So once a subset is decided:

Grab the collection + items
For every asset in an item, replace its href with the alternate:s3 href, swap its auth for the alternates
Not sure where "description": "S3 storage provided by CloudFerro Cloud and OpenTelekom Cloud (OTC). Use endpoint URLhttps://eodata.dataspace.copernicus.eu.", would live on the item? It appears to be a property on the alternate entry but I couldn't see where it would live on the asset after?
Do the collection changes @j08lue just posted

MathewNWSH commented 2 days ago

Regarding which subset, anything easy enough to handle is fine.
* 2024 to date, all of Europe?

* past 3 years, whatever country/region - Iceland? 🤷 🇮🇸 🌋 🧊

Since pgstac.demo.cloudferro.com is still in development and the S2L2A collection consists of only a few items, here is a JSON file containing 9,352 (one missing product: /eodata/Sentinel-2/MSI/L2A/2023/03/29/S2B_MSIL2A_20230329T100629_N0509_R022_T33UWV_20230329T130657.SAFE) products intersecting with Poland's geometry, ranging from content_start_date >= '2023-01-01 00:00:00' to content_start_date < '2024-01-01 00:00:00'.

file size -> 656.5 MB Hope it will help: https://s3.fra1-2.cloudferro.com/swift/v1/poland-stac/poland-data.json

j08lue commented 2 days ago

Ah, was not aware, thank you!

Poland 2023 works, let's see whether we need to subset further if 9k is too much for a quick fix. 🇵🇱

ciaransweet commented 2 days ago

@MathewNWSH are you able to export this as a list of items in JSON?

I can't load it with json.load() right now..

jonas-eberle commented 2 days ago

@ciaransweet

Each line consists of a STAC JSON item. The following works for me to print the first STAC item: head -n1 poland-data.json | jq . | less

ciaransweet commented 2 days ago

@ciaransweet

Each line consists of a STAC JSON item. The following works for me to print the first STAC item: head -n1 poland-data.json | jq . | less

Sure, it would be nice if it was wrapped into a json array to be a bit more 'valid' to read in :D

I'll process line by line for now.

jonas-eberle commented 2 days ago

Understood. The format (ndjson) is used by pypgstac as well. I guess this is why @MathewNWSH has made this format available.

ciaransweet commented 2 days ago

Understood. The format (ndjson) is used by pypgstac as well. I guess this is why @MathewNWSH has made this format available.

Cool thanks, good to know!

MathewNWSH commented 2 days ago

@MathewNWSH are you able to export this as a list of items in JSON?

I can't load it with json.load() right now..

Yup, I've just loaded it into the pgstac instance using: pypgstac load items https://s3.fra1-2.cloudferro.com/swift/v1/poland-stac/poland-data.json

If you prefer, you can get it in the form of item list using: https://pgstac.demo.cloudferro.com/collections/sentinel-2-l2a/items?limit=1000 and then using https://pgstac.demo.cloudferro.com/collections/sentinel-2-l2a/items?limit=1000&token=next:sentinel-2-l2a:S2A_MSIL2A_20231123T095321_N0509_R079_T34UCC_20231123T141252 move to another page, and so on.

MathewNWSH commented 2 days ago

for 1000 items of s2l2a it takes 1.67 min to reply ;/ quite a lot but on the other hand this collection is the richest in metadata among all sentinels / processing levels

ciaransweet commented 2 days ago

No worries :D We should have specified we weren't expecting it for pypgstac and just pystac, but I can work with it knowing it's line delimited, thanks!

jonas-eberle commented 2 days ago

for 1000 items of s2l2a it takes 1.67 min to reply ;/

That is quite slow. The same query for sentinel-2-l2a took 4 seconds (without caching) on my pgstac-based STAC API with a total of 11 million scenes within this collection.

MathewNWSH commented 2 days ago

for 1000 items of s2l2a it takes 1.67 min to reply ;/

That is quite slow. The same query for sentinel-2-l2a took 4 seconds (without caching) on my pgstac-based STAC API with a total of 11 million scenes within this collection.

the demo is basing on: https://github.com/stac-utils/stac-fastapi-pgstac/blob/main/docker-compose.yml deployed via docker compose up. Soon we will move to bare metal server (I'm waiting for postgres 17 release)

this is the sample item of S2L2a, as you can see it's quite huge for a single item: S2B_MSIL2A_20240110T100309_N0510_R122_T33UVR_20240110T113053.json

I was blaming compose deployment and huge size of an item for performance problems.

Is there any option to exchange experience from using stac-fastApi-pgstac with you? Or general recommendations while working with pgstac/stac-fastAPI-pgstac?

jonas-eberle commented 2 days ago

this is the sample item of S2L2a, as you can see it's quite huge for a single item

Yes, this is quite huge. Our STAC item (generated with stactools-sentinel2 package) is smaller (we deleted downsampled assets): https://stac.terrabyte.lrz.de/public/api/collections/sentinel-2-c1-l2a/items/S2B_MSIL2A_20240110T100309_N0510_R122_T33UVR_20240110T113053

Is there any option to exchange experience from using stac-fastApi-pgstac with you? Or general recommendations while working with pgstac/stac-fastAPI-pgstac?

Sure. We also have an issue on working on performance improvements and best practice guidelines within EOEPCA (https://github.com/EOEPCA/resource-discovery/issues/23). I just commented there.

@MathewNWSH Please feel free to contact me via mail as well: jonas.eberle@dlr.de

EOEPCA / data-access

Add more sample data to eoAPI #75

Acceptance criteria