Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 15 forks source link

load_stac: band order remark #488

Open soxofaan opened 9 months ago

soxofaan commented 9 months ago

https://github.com/Open-EO/openeo-processes/blob/ad8a2f3551fe9d69c1689e310bf5307e20d86113/proposals/load_stac.json#L4

has this remark:

The bands (and all dimensions that specify nominal dimension labels) are expected to be ordered as specified in the metadata if the bands parameter is set to null.

What does "ordered as specified in the metadata" mean practically? Also, doesn't this heavily depends on STAC extensions in play (if any)?

For example, take a STAC Item like this (using the eo STAC extension):

{
  "type": "Feature",
  "assets": {
    "B04": {
      "eo:bands": [{"name": "B04"}],
      ...
    "B05": {
      "eo:bands": [{"name": "B05"}],  
      ...

I assume the spirit of the remark above is to take band order ["B04", "B05"], but this comes from the "assets" mapping, which technically does not imply an order.

soxofaan commented 9 months ago

cc @bossie

bossie commented 9 months ago

FYI, my interpretation was that this "metadata" refers to the list of bands in an Item's properties or a Collection's summaries.

m-mohr commented 9 months ago

Yeah, this needs clarification.

Proposal:

  1. If cube:dimensions is present, use the order of values for the corresponding dimension if available.
  2. If eo:bands, raster:bands or soon bands is available in Item Properties or Collection Summaries, use the order from the array.
  3. If eo:bands, raster:bands or soon bands is available in assets:
    1. For a single data asset with bands: Use the order from the array.
    2. For multiple assets with bands: Sort the band names (following the sort process).
  4. If nothing is present, assign zero-based indices as band names (as we do in other processes such as apply_dimension) - the user probably has to experiment in this case as the order of the actual bands is not clear.

For categorical, non-band dimensions (i.e. type other) 1 and 4 apply. x,y,z,t should be clear as they usually have an implicit order.

m-mohr commented 9 months ago

See also #489 for a related issue.

soxofaan commented 9 months ago

To be devils advocate: that proposal looks quite complicated (lot of ifs and branches) which on its own is not very user friendly, and it also practically means that effective band order (and even band names) might suddenly change if the data provider changes/improves their STAC metadata (e.g. add eo:bands under item properties).

I wonder if it wouldn't be better to keep it simpler (including allowing the actual band order and names to be undefined if metadata is missing) and just add a strong recommendation for users to explicitly specify bands argument in load_stac to avoid surprises.

m-mohr commented 9 months ago

That would be another, more simple option, indeed. The list above only applies if the bands array is not specified anyway.

m-mohr commented 9 months ago

Let's try to simplify and still give the user something to work with:

  1. If bands is provided in load_collection / load_stac: Use the order as provided by the user.
  2. Use the order as provided in the values for the corresponding dimension, if available
  3. Fall back to the order in the file format (the bands arrays mirror what is in the file anyway)

For load_collection it's simpler, there only 1 and 2 apply.

soxofaan commented 9 months ago
  1. Use the order as provided in the values for the corresponding dimension, if available

Based on #491, I guess you mean with "values" the "values" field from "cube:dimensions" from the datacube STAC extension?

  1. Fall back to the order in the file format (the bands arrays mirror what is in the file anyway)

I'm not completely sure what you mean with "file" or "the file" in this context, but as I mentioned in #491 I think this might not be as trivial as is sounds: there might be multiple "files" in play with inconsistent band sets or band order; the "file" aspect of the data to load might be an implementation detail of the data provider and subject to change

m-mohr commented 9 months ago

Based on #491, I guess you mean with "values" the "values" field from "cube:dimensions" from the datacube STAC extension?

Yes.

I'm not completely sure what you mean with "file" or "the file" in this context, but as I mentioned in #491 I think this might not be as trivial as is sounds: there might be multiple "files" in play with inconsistent band sets or band order; the "file" aspect of the data to load might be an implementation detail of the data provider and subject to change

Source data, whatever that might be. Usually COGs, netCDF, ... For me this is a fall back and as such a best effort thing to give at least some clue, e.g. how GDAL does it maybe. I assume consistent STAC catalogs here. If this doesn't work we are back to "undefined" anyway, so I'm not sure whether having the file reference really hurts.

m-mohr commented 1 month ago

dev telco:

  1. cube:dimensions if applicable
  2. band names
  3. band indices (if multiple bands per file)
  4. asset names (if one band per file) - sorted in alphabetical order?