Open-EO / openeo-python-client

Python client API for OpenEO
https://open-eo.github.io/openeo-python-client/
Apache License 2.0
149 stars 38 forks source link

Math processes on the top-level / multiple save_results in process #609

Open m-mohr opened 2 weeks ago

m-mohr commented 2 weeks ago

Hi,

I have Python client code:

import openeo
from openeo.processes import power

connection = openeo.connect("https://earthengine.openeo.org").authenticate_basic("group2", "test123")

aoi = {
  "type": "Polygon",
  "coordinates": [
    [[-7.664532, 38.543869], [-7.689248, 38.141037], [-7.159228, 38.151837], [-7.11289, 38.554609], [-7.664532, 38.543869]]
  ]
}

EPSG = 32629

data = connection.load_collection(
  collection_id = "COPERNICUS/S2_SR_HARMONIZED",
  spatial_extent = aoi,
  temporal_extent = ["2019-06-27T00:00:00Z", "2019-07-04T00:00:00Z"],
  bands = ["B1", "B2", "B3", "B4"]
)

# If we get multiple images (should not happen for the given extent), compute the mean
# Divide by 10000 to convert from DN to reflectance values
data = data.mean_time() / 10000

# Assign the indivdual bands to variables
B1 = data.band("B1")
B2 = data.band("B2")
B3 = data.band("B3")
B4 = data.band("B4")

# Density of cyanobacteria
cyanobacteria = 115530 * power((B3 * B4) / B2, 2.38)
save1 = cyanobacteria.save_result(
  format = "GTIFF",
  options = {
    "name": "cyanobacteria",
    "metadata": {
      "bands": [ { "statistics": { "minimum": 0, "maximum": 100 } } ]
    },
    "epsgCode": EPSG
  }
)

Unfortunately, I get the following error once I run result1 = connection.execute(save1):

Preflight process graph validation raised: [ProcessArgumentInvalid] The argument 'base' in process 'power' (namespace: n/a) is invalid: Schema for result 'reducedimension2' not compatible

Looking at the generated JSON, this is not overly surprising anymore:

grafik

Why does the client put power and multiply on the top-level? It works flawlessly with the divide by 10000 in apply.

m-mohr commented 2 weeks ago

And if I try to add more save_result nodes:

chlorophyll_a = 4.26 * power(B3 / B1, 3.94)
cyanobacteria.save_result(
  format = "GTIFF",
  options = {
    "name": "chlorophyll_a",
    "metadata": {
      "bands": [ { "statistics": { "minimum": 0, "maximum": 200 } } ]
    },
    "epsgCode": EPSG
  }
)

# Turbidity
turbidity = 8.93 * (B3 / B1) - 6.39
result = cyanobacteria.save_result(
  format = "GTIFF",
  options = {
    "name": "turbidity",
    "metadata": {
      "bands": [ { "statistics": { "minimum": 0, "maximum": 30 } } ]
    },
    "epsgCode": EPSG
  }
)

job = connection.create_job(title = "OSPD Algal Bloom usecase (Python)", process_graph=result)

So that it gets closer to:

workflow

It seems to not pick up the additional nodes.

Seems I'm pushing some boundaries ;-)

soxofaan commented 2 weeks ago

yeah it's a bit hard to explain what is going on here

Why does the client put power and multiply on the top-level? It works flawlessly with the divide by 10000 in apply.

Simply put: the divide is a method call (disguised in syntactic sugar) so it's aware about working on a data cube and translates it to an apply with the division as child process. The power is just a function that is not smart enough to know it should use apply with child process, but instead it does the power on the full data cube, which results in this top level power node.

workaround is to do the power something as follows:

cyanobacteria = 115530 * ((B3 * B4) / B2).apply(lambda x: x.power(2.38))

But I understand this is indeed not obvious.

I'm not sure yet how to improve the situation here, e.g. make processes like power smarter to do the right thing, or throw an error pointing to a better approach.

m-mohr commented 2 weeks ago

Thanks, yes, this works. Indeed it would be more obvious if the power function somehow would react if one of the inputs is a datacube or so.

I'm not sure how to create the process graph with 4 save_results, but for now I'll just create 3 jobs I guess...

soxofaan commented 1 week ago

I'm not sure how to create the process graph with 4 save_results, but for now I'll just create 3 jobs I guess...

Indeed, that is indeed roadblocked by some old outdated assumptions, but we're looking into improving that:

(note that we're also working on backend-side support for that)