Generic and/or specific process definitions?

This came up in #60 and was suggested by @edzer:

What we should do is work with generic functions like min, max, mean, sum etc. and apply them to dimensions rather than write combination functions. Instead of time_min(obj) we would use map(obj, "time", min). Similar for aggregate: aggregate(obj, predicate, fun).

We need to discuss how we want to have our processes defined. I'll contribute some thoughts, pros/cons and examples later.

Google Earth Engine does both. They have the generic functions (e.g. filter combined with Date and Bounds classes) and many wrappers around them (e.g. filterDate, filterBounds).

General thoughts

There are two approaches we can use: specific functions (e.g. filter_bands,or min_time) and generic functions (e.g. filter and aggregate with appropriate parameters). Common specific functions could be wrapped around generic functions. Google Earth Engine does this.

Complexity of process discovery / Documentation

The structure for process discovery is much more complex than for specific functions. Individual generic functions can have only specific values for several arguments and these influence which additional parameters can be used. It will be even harder if you want to reuse commonly used functions, like min/max/count...

It's probably easier to document specific functions, but on the other hand it's easier to have all aggregate related stuff documented together with generic functions. But you could easily adopt that to specific functions by tags or categories or so.

Number of processes

Of course, the number of processes are much much higher for specific functions. The scope of functions will be the same nevertheless.

Process graph size

Depends on the implementation for generic functions. Can be as small as specific functions, see examples.

Learning curve / Usability

For users the generic functions seem to be much harder to use and learn. The process graph (see examples) is more complex to read. It could be confusing regarding the arguments. The amount of stuff to learn seems very similar, it's just differently structured.

Extensibility

It might be a bit easier to extend the generic functions, but in the end it shouldn't be much different regarding this aspect.

Implementation

Probably very subjective, but I think it's easier to implement specific functions: Generic functions need much better checking whether arguments are the way they should be and where they belong to. Some more "strict" languages like Java might have problems when implementing such "dynamic" stuff. You probably would have to work with dictionaries there to pass arguments.

UDFs

Need to think more about this... and might depend on the concrete UDF implementation.

Comparison

A quick comparison based on the thoughts above (probably not very objective in many cases):

Argument	Specific	Generic
Complexity of process discovery / Documentation	+	-
Number of processes	-	+
Process graph size	+	+/-
Learning curve / Usability	0	-
Extensibility	0	+
Implementation	+	0
UDFs	?	?

It seems there is not really one method that stands out. I personally prefer the specific functions, mostly as they seem a bit more easy to use and process discovery is not as complex. But that is probably mostly because I don't have a strong background in functional programming.

Examples for specific functions

Note: Examples are mostly based on API v0.0.2.

Process graph

{
  "process_graph":{
    "process_id":"min_time",
    "args":{
      "imagery":{
        "process_id":"filter_daterange",
        "args":{
          "imagery":{
            "process_id":"filter_bbox",
            "args":{
              "imagery":{
                "product_id":"S2_L2A_T32TPS_20M"
              },
              "left":652000,
              "right":672000,
              "top":5161000,
              "bottom":5181000,
              "srs":"EPSG:32632"
            }
          },
          "from":"2017-01-01",
          "to":"2017-01-31"
        }
      }
    }
  }
}

Process discovery

[
  {
    "process_id":"filter_daterange",
    "description":"Drops observations from a collection that have been captured before a start or after a given end date.",
    "args":{
      "imagery":{
        "description":"array of input collections with one element"
      },
      "from":{
        "description":"start date"
      },
      "to":{
        "description":"end date"
      }
    }
  },
  {
    "process_id":"filter_bbox",
    "description":"Drops observations from a collection that are located outside of a given bounding box.",
    "args":{
      "imagery":{
        "description":"array of input collections with one element"
      },
      "left":{
        "description":"left boundary (longitude / easting)"
      },
      "right":{
        "description":"right boundary (longitude / easting)"
      },
      "top":{
        "description":"top boundary (latitude / northing)"
      },
      "bottom":{
        "description":"bottom boundary (latitude / northing)"
      },
      "srs":{
        "description":"spatial reference system of boundaries as proj4 or EPSG:12345 like string"
      }
    }
  },
  {
    "process_id":"min_time",
    "description":"Finds the minimum value of time series for all bands of the input dataset.",
    "args":{
      "imagery":{
        "description":"array of input collections with one element"
      }
    }
  }
]

Examples for generic functions

Process graph

{
  "process_graph":{
    "process_id":"aggregate",
    "args":{
      "imagery":{
        "process_id":"filter",
        "args":{
          "imagery":{
            "process_id":"filter",
            "args":{
              "imagery":{
                "product_id":"S2_L2A_T32TPS_20M"
              },
              "filter":"bbox",
              "bbox:left":652000,
              "bbox:right":672000,
              "bbox:top":5161000,
              "bbox:bottom":5181000,
              "bbox:srs":"EPSG:32632"
            }
          },
          "filter":"daterange",
          "daterange:from":"2017-01-01",
          "daterange:to":"2017-01-31"
        }
      },
      "dimension":"time",
      "function":"min"
    }
  }
}

Process discovery

This is a first draft and probably needs much more work.

[
  {
    "process_id":"filter",
    "description":"Filters image collections using a specific filter.",
    "args":{
      "imagery":{
        "description":"array of input collections with one element"
      },
      "filter":[
        {
          "value":"daterange",
          "description":"Drops observations from a collection that have been captured before a start or after a given end date.",
          "args":{
            "from":{
              "description":"start date"
            },
            "to":{
              "description":"end date"
            }
          }
        },
        {
          "value":"bbox",
          "description":"Drops observations from a collection that are located outside of a given bounding box.",
          "args":{
            "left":{
              "description":"left boundary (longitude / easting)"
            },
            "right":{
              "description":"right boundary (longitude / easting)"
            },
            "top":{
              "description":"top boundary (latitude / northing)"
            },
            "bottom":{
              "description":"bottom boundary (latitude / northing)"
            },
            "srs":{
              "description":"spatial reference system of boundaries as proj4 or EPSG:12345 like string"
            }
          }
        }
      ]
    }
  },
  {
    "process_id":"aggregate",
    "description":"Aggregates values in an image collection.",
    "args":{
      "imagery":{
        "description":"array of input collections with one element"
      },
      "dimension":[
        {
          "value":"time",
          "description":"Time is used as dimension..."
        }
      ],
      "function":[
        {
          "value":"min",
          "description":"Calculates the minimum."
        }
      ]
    }
  }
]

Thanks, Matthias! @mkadunc came up with this idea during the workshop at VITO. I think functional programming is an extremely powerful concept. Having generics filter, aggregate and reduce (and maybe more?) will make everything easier, including implementing back-ends and doing UDFs. Let's take some time and see how these ideas are being put to work in GEE, databases, R, and so on.

@edzer I extended the post above a bit...

I disagree that it makes everything(!) easier. Especially process discovery (and comparison between back-ends) will be more complex. Back-end implementations will be easier in functional programming languages for sure, but in others it might be harder to accomplish this structure. But overall getting this structure running won't have a deep impact time-wise, I think. The scope of functions doesn't change much so the implementation work should be similar. Regarding the process_graph it is mostly just another way of expressing things / a different notation. So API wise it is probably mostly a matter of personal taste. I can't really predict how much it influences the back-ends.

GEE implements filters, reducers etc mostly as classes, which is more the object-oriented style. This would require a change in the process graph definition (now: every object has to be a process, but "daterange" or "bands" is not a process itself). Which databases one could look at?

I agree that not everything is easier, here are some of my thoughts on the subject:

Specifying the API of generic filter, aggregate, reduce processes (that can work on any combination of dimensions and with any function used inside the loop) is definitely more difficult than specifying the API of a small set of filter_by_<something> and min_by_<something> functions. Implementing the above generic processes on the back-end to work properly using any combination of user-provided dimensions and predicate processes is definitely a difficult task, much more so than having a small set of predefined specific processes.

The problem with a small set of predefined specific processes is that it won't remain small much longer. The set will inevitably grow because the users will require expressing more complex algorithms and they will come up with new use-cases all the time, As the number of such processes grows, so will the effort of specifying and implementing them - each on its own will still be easier to define and implement than the generic alternatives, but the number will become hard to manage and maintain in a consistent manner.

If we start with generic processes, we put in extra effort at the beginning, but after we've implemented that we have a vastly more flexible and powerful syntax compared to implementing a couple of (most frequently used) filters and aggregators. Implementation of any subsequent wrappers becomes trivial (I think we should still have wrappers for most common cases, and to improve the learning curve for new users).

Regarding your proposal, I think we should strive to make the filters and aggregators into processes as well to have as much flexibility and reuse as possible (my understanding is that that process graphs are basically functions, correct me if i'm wrong) - the 'predicate' argument of filter is a function that takes a (subset of) image collection and returns a single boolean; an aggregator (function used in aggregate ops) is a process that takes an image collection that is an array (usually 1-dimensional) and returns a single value - usually scalar such as median (but it could also be vector, e.g. returning 5th, 50th and 95th percentiles). I believe the whole concept is pretty simple - the easiest way to understand it IMO is having a look at:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/filter as a model for OpenEO filter
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map as a model for OpenEO aggregate

The difference between Javascript's arrays and OpenEO is only in the splitting of a collection into iterable items over which these functions operate (js arrays always iterate over the first dimension, openEO iteration dimension will be determined by the caller of the generic functions)

The difficult bit in this whole design would be defining the 'process' in a wide-enough way for it to be useful as a parameter of such higher-order functions (e.g. process would need to support returning scalar values and booleans given a specific projection of a subset of an image collection)

To me it seems like we can continue using the already defined specific functions, which would not prevent us from adding generic filter/aggregate functions add a later time. For me, functional programming only becomes useful when you can actually pass truly custom filtering and aggregating functions (UDF's) in the process graph, but then we need a language-agnostic way to define these functions.

When it comes to the API, I'm in favour of specific functions. The backends, if they use functional programming, can easily use generic functions and then export all the specific variants as wrappers.

The issue with generic functions was nicely demonstrated by @m-mohr in the process discovery part: if we have a generic filter function, it could take either a bounding box or a set of dates, and you can't typecheck that. The second argument becomes a generic "something that satisfies the type you gave in the first argument", but the type is undefined. If you nevertheless try to define the variable types for each particular argument combination (as demostrated in the OP), you are back to defining specific functions anyway. In which case they ought to be separate functions so that the types can be easily checked and understood by the users.

In R, we have nice ways of dealing with that when we have object types: we can have a function filter.Extent() and a function filter.Date(), and then users can simply run filter(MyExtentObject) and the correct version of the function will get called by checking the class of MyExtentObject. But the functions themselves are specific, because they have different arguments, and the users can also opt to run filter.Extent(MyExtentObject) if they want to make sure the right specific function gets called.

Another issue with generic functions in the API is what happens if a backend does not implement every single possible variant. If these are separate processes, they would simply not be listed during process discovery. But if it's a generic process, you can't know what functionality it supports and what it doesn't.

Also, there are some cases where it makes sense to have functions be generic. The map case in GEE or calc/*apply case in R is a good example. But in these cases the arguments are well defined, it's just that the second argument is a function that is run over the values of the first argument.

Seems to be solved by our approach that we discussed in the VITO sprint.

Open-EO / openeo-processes