Open-EO / openeo-earthengine-driver

openEO back-end driver for Google Earth Engine.
Apache License 2.0
21 stars 7 forks source link

Architectural issue with sync/async processing #36

Closed m-mohr closed 7 months ago

m-mohr commented 4 years ago

We have a architectural issue in the GEE driver, which I don't have a solution for so far.

The GEE JS API mostly runs synchronous except for some network related tasks (getDownloadUrl and getInfo for example). Escpecially, you can't run asynchronous thinks in functions like ee.ImageCollection.map etc.

On the other hand, we have some openEO GEE driver tasks that run asynchronously (e.g. loading things from database, file system or network for validation or execution).

Problem is that you can't mix both worlds easily. Once something in GEE is synchronous like ee.ImageCollection.map, you can't run anything asynchronous in it and the other way round.

Here's an example from the newly introduced aggregate_temporal_frequency:

async execute(node) {
  // Get parameters and set some variables
  var dc = node.getDataCube('data');
  var frequency = node.getArgument('frequency');

  // prepare image collection with aggregation label
  var images = Commons.setAggregationLabels(dc.imageCollection(), frequency);

  // aggregate based on aggregation label

  // Get a unique list of all year/season labels
  var newLabels = ee.List(images.aggregate_array('aggregationLabel')).distinct();

  // Aggregation for each year/season label
  var aggregatedImages = newLabels.map(label => {
    var collection = images.filterMetadata('aggregationLabel', 'equals', label);
    var firstImg = collection.first();
    // ISSUE 1: Here's async code (await) in a map function that expects a sync callback
    var image = await this.reduce(node, collection);
    return image.copyProperties({source: firstImg, properties: firstImg.propertyNames()});
  });

  // Update data cube
  dc.setData(ee.ImageCollection(aggregatedImages));

  var dimensionName = node.getArgument('dimension');
  var dimension = dc.dim(dimensionName);
  if (dimension === null) {
    dimension = dc.dimT();
  }

  // ISSUE 2: Here's async code (getInfo) so execute needs to stay async
  var dimLabels = await Utils.promisify(newLabels, 'getInfo');
  dimension.setValues(dimLabels);

  return dc;
}

So the issue is that the execute functions need to be async for some use cases (see issue 2), which makes the process graph execution async in general. On the other hand, process graph execution need to be synchronous as otherwise we can't execute callbacks such as complex reducers in GEE functions such as map (see issue 1).

I don't see a solution as of now, but a big issue that will limit the usefulness of the GEE driver a lot. I can't implement aggregate_temporal_frequency at the moment due to the way the GEE JS API works (all sync). Maybe there's a way around it, but I couldn't figure it out from the documentation.

m-mohr commented 4 years ago

I have some ideas:

That will be a major task to implement and nothing we can do anytime soon, I guess.

For now, I made aggregate_temporal_frequency work with simple reducers only, which should handle most use cases and is synchronous.

cc @bgoesswe @claxn

gena commented 9 months ago

Have you tried using evaluate() instead of getInfo(), which is supposed to be async https://developers.google.com/earth-engine/apidocs/ee-list-evaluate?

m-mohr commented 9 months ago

Not yet, thank you for pointing me to it. I'll have a look.

m-mohr commented 7 months ago

Unfortunately, this is not an issue purely related to GEE processes. There's also some internal code that we have to run (e.g. get some values from a database). This is always asynchronous in our case and as such the only thing I can see here is that GEE would somehow allow async callbacks.

m-mohr commented 7 months ago

Solved for I guess 80% of the use cases by making the process graph execution synchronous. The remaining 20% we need to mitigate differently by optimizing e.g. the GEE client code, let's see how it works in the future. Closing for now.