Open-EO / openeo-geotrellis-extensions

Java/Scala extensions for Geotrellis, for use with OpenEO GeoPySpark backend.
Apache License 2.0
5 stars 3 forks source link

support 'any' process in filter_labels #239

Open jdries opened 10 months ago

jdries commented 10 months ago

In filter labels, when chaining together many (100+) 'or' processes, we run into maximum recursion depth:

  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_collections.py", line 82, in __setitem__
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1314, in __call__
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1277, in _build_args
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1264, in _get_args
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_collections.py", line 523, in convert
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_collections.py", line 82, in __setitem__
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1314, in __call__
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1277, in _build_args
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1263, in _get_args
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_collections.py", line 490, in can_convert
  File "/usr/lib64/python3.8/abc.py", line 98, in __instancecheck__
    return _abc_instancecheck(cls, instance)
RecursionError: maximum recursion depth exceeded in comparison

The best solution is to use the 'any' process instead, which should then receive an array of 'date_between' comparisons.

We will need to extend our new type checking mechanism in OpenEOProcessScriptBuilder to allow detecting this case.

jdries commented 10 months ago

@EmileSonneveld I committed a first version of this, to get something to the user that needs this. Would be good if you could:

EmileSonneveld commented 10 months ago

This code worked:

def build_condition(x):
    conditions = []
    dates = ["2021-01-02", "2021-01-05", "2021-02-01", "2021-02-04"]
    for date in dates:
        min_date = (datetime.datetime.fromisoformat(date)).isoformat() + "Z"
        max_date = (datetime.datetime.fromisoformat(date) + datetime.timedelta(days=1)).isoformat() + "Z"
        conditions.append(process("date_between", x=x, min=min_date, max=max_date))
    return any(conditions)

condition = build_child_callback(build_condition, parent_parameters=["value"])

datacube = datacube.process(process_id="filter_labels",
                            arguments={"data": datacube, "condition": condition},
                            dimension="t")

But nicer syntax will be available in next release:

from openeo.processes import any, process, date_between
from openeo.util import rfc3339

def filter_labels_condition(x):
    conditions = []
    dates = ["2021-01-02", "2021-01-05", "2021-02-01", "2021-02-04"]
    for date in dates:
        date = rfc3339.parse_date(date)
        min_date = rfc3339.date(date)
        max_date = rfc3339.date(date + datetime.timedelta(days=1))
        conditions.append(date_between(x=x, min=min_date, max=max_date))
    return any(conditions)

datacube = datacube.filter_labels(condition=filter_labels_condition, dimension="t")

date_between needs to be supported in the editor

EmileSonneveld commented 9 months ago

date_between now also available in web editor

EmileSonneveld commented 9 months ago

any can be used in 2 ways. To reduce values in apply_dimension, or to use as to reduce a bunch of single values. The following example uses the element in both ways, but gives an error:

{
  "process_graph": {
    "loadcollection1": {
      "process_id": "load_collection",
      "arguments": {
        "bands": [
          "B04"
        ],
        "id": "SENTINEL2_L2A",
        "spatial_extent": {
          "east": 5.08,
          "north": 51.22,
          "south": 51.215,
          "west": 5.07
        },
        "temporal_extent": [
          "2021-01-01",
          "2021-03-01"
        ]
      }
    },
    "apply1": {
      "process_id": "apply_dimension",
      "arguments": {
        "data": {
          "from_node": "loadcollection1"
        },
        "dimension": "t",
        "process": {
          "process_graph": {
            "any1": {
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "process_id": "any"
            },
            "any2": {
              "arguments": {
                "data": [
                  {
                    "from_node": "constant1"
                  },
                  {
                    "from_node": "constant2"
                  }
                ]
              },
              "process_id": "any"
            },
            "constant1": {
              "arguments": {
                "x": false
              },
              "process_id": "constant"
            },
            "constant2": {
              "arguments": {
                "x": true
              },
              "process_id": "constant"
            },
            "if5": {
              "arguments": {
                "accept": {
                  "from_node": "any1"
                },
                "value": {
                  "from_node": "any2"
                }
              },
              "process_id": "if",
              "result": true
            }
          }
        }
      }
    },
    "saveresult1": {
      "process_id": "save_result",
      "arguments": {
        "data": {
          "from_node": "apply1"
        },
        "format": "GTiff"
      },
      "result": true
    }
  },
  "parameters": []
}
EmileSonneveld commented 9 months ago

With a more fine tuned process graph, the any node actually does work in 2 different use cases

process graph ```json { "process_graph": { "loadcollection1": { "process_id": "load_collection", "arguments": { "bands": [ "B04" ], "id": "SENTINEL2_L2A", "spatial_extent": { "east": 5.08, "north": 51.22, "south": 51.215, "west": 5.07 }, "temporal_extent": [ "2021-01-09", "2021-01-13" ] } }, "apply1": { "process_id": "apply_dimension", "arguments": { "data": { "from_node": "loadcollection1" }, "dimension": "t", "process": { "process_graph": { "any2": { "process_id": "any", "arguments": { "data": [ 1, 0, 1 ] } }, "any1": { "process_id": "any", "arguments": { "data": [ { "from_node": "gt1" } ] } }, "if5": { "process_id": "if", "arguments": { "accept": { "from_node": "any1" }, "value": { "from_node": "any2" } } }, "multiply1": { "process_id": "multiply", "arguments": { "x": { "from_node": "if5" }, "y": 1 }, "result": true }, "gt1": { "process_id": "gt", "arguments": { "y": 700, "x": { "from_parameter": "data" } } } } } }, "result": true } }, "parameters": [] } ```
EmileSonneveld commented 8 months ago

any works fine with floats and booleans for tiles, but fails on booleans in 'constant mode': Method constantArrayElement([class java.lang.Boolean]) does not exist. I'll check for a quick fix (In the any node, ignore_nodata seems ignored)

Some code changed around the eq node. It also gives an error when using it with booleans in 'constant mode': java.lang.ClassCastException: class java.lang.Boolean cannot be cast to class scala.collection.Seq Floats work in 'constant mode'. Floats and booleans work in tile mode

process graph ```json { "process_graph": { "loadcollection1": { "process_id": "load_collection", "arguments": { "bands": [ "SCL" ], "id": "SENTINEL2_L2A", "spatial_extent": { "east": -25, "north": 41, "south": 39.5, "west": -26.5 }, "temporal_extent": [ "2021-01-01", "2021-01-10" ] } }, "apply1": { "process_id": "apply", "arguments": { "data": { "from_node": "loadcollection1" }, "process": { "process_graph": { "cos1": { "process_id": "cos", "arguments": { "x": { "from_parameter": "x" } } }, "eq1": { "process_id": "eq", "arguments": { "x": { "from_parameter": "x" }, "y": 9, "delta": 0 } }, "eq2": { "process_id": "eq", "arguments": { "x": 1, "y": 1 } }, "if1": { "process_id": "if", "arguments": { "accept": { "from_node": "eq1" }, "value": { "from_node": "eq2" }, "reject": { "from_node": "cos1" } } }, "multiply1": { "process_id": "multiply", "arguments": { "x": { "from_node": "if1" }, "y": 1 }, "result": true } } } } }, "saveresult1": { "process_id": "save_result", "arguments": { "data": { "from_node": "apply1" }, "format": "netcdf" }, "result": true } }, "parameters": [] } ```
EmileSonneveld commented 5 months ago

Maybe related: https://github.com/Open-EO/openeo-geotrellis-extensions/issues/286