Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
49 stars 14 forks source link

load_stac: stac_items #501

Open jdries opened 2 months ago

jdries commented 2 months ago

Proposed Process ID: load_stac Proposed Parameter Name: stac_items Optional: yes, default: None

Context

load_stac is very popular for loading user defined data, but require the stac json to be available via http url. In many cases, such a url is not available, and the user thus needs to rely on a 3rd party service (e.g. github) to upload the stac json. I see also a use case for systems that require signed urls for data access, where the user first needs to sign urls using a secret key.

Description

stac_items, if provided, is an array of valid STAC Item object. The backend will load all assets in the provided items.

Data Type

array of objects

Additional changes

the other parameters would no longer need to be present if stac_items is provided directly. Alternative option is of course to turn this into a separate 'load_stac_items' process, with a single parameter?

clausmichele commented 2 months ago

@jdries so, if I understand it correctly, you would like to directly pass the STAC items as json/text in the process graph instead of an URL? It could be a good idea!

From what I understand, if we integrate it in load_stac an user can provide:

m-mohr commented 1 month ago

This can quickly become problematic. Many STAC Items don't have absolute URLs and then you can't load the data if the self url isn't set. Usually you can use the Item URL if no self url is given, but the URL is not available here as fallback. Also, the JSON size can explode quickly if people start to pass thousand of Items.

Generally, I think I'd prefer a separate process if at all.

jdries commented 1 month ago

it's indeed limited to cases where you use absolute url's and don't send thousands of items. The use case is really a user that wants to point to a low number of files that are online somewhere, but don't have a corresponding stac item online. In general, not all of our users have a STAC API or http service at hand where they manage to quickly upload some items. The process graph also becomes more self-contained if it just includes the STAC metadata.

In fact, our new load_stac sample somewhat illustrates it: https://github.com/Open-EO/openeo-community-examples/blob/main/python/LoadStac/load-stac-item-example.ipynb at a given point, it says 'make sure you upload your item', that step is the tricky part.

m-mohr commented 1 month ago

The place that would allow users to do that is the openEO /files endpoints. That was the original intention that users could upload any related files such as GeoJSON, STAC, etc. there. Due to the lack of implementation we didn't push this through the processes either, but maybe we should to encourage it.

The other thing with the STAC example you linked to: Creating a STAC Item for this purpose seems "overkill". You could easily just capture all information you need in a simpler format, I believe, i.e. just a list of assets:

{
    "ndvi": {
        "href:" tiff_url,
        "type": "image/tiff; application=geotiff; profile=cloud-optimized",
        "eo:bands": [ # REQUIRED: define the bands in the eo extension for openEO to be able to load it
            {
                "name": "NDVI-band",
            }
        ],
        "proj:epsg": src.crs.to_epsg(),
        "proj:shape": src.shape, # Caveat: this is [height, width] and not [width, height] if you want to set them yourself
        "proj:bbox": proj_bounds,
    }
}

I assume you don't need the geometry and the projected bbox is enough, but not sure.

Do we have an agreed consensus across providers what the STAC Items need to contain to be read (and maybe optional ones for more efficiency)?

And then I'm wondering, why not just: load_url(tiff_url, "GTiff", {bands: ["NDVI-band"], ...})?