Open-EO / openeo-api

The openEO API specification
http://api.openeo.org
Apache License 2.0
91 stars 11 forks source link

UDF Usage #198

Closed flahn closed 4 years ago

flahn commented 5 years ago

I'm currently trying to catch up with the current UDF implementations for R and Python to change/improve the R implementation.

While going through the issues on the UDFs and related issues in the openeo-processes I think we have major problems with users getting into UDFs. As I see it stated in the issues on Open-EO/openeo-processes#42 and Open-EO/openeo-udf#10 the information an user who wants to implement a successfully running UDF is lacking the following:

  1. The dimension specification of the data / data cube at a given point during the execution of a process graph
  2. How the data has to be in terms of the dimension specification, when the UDF result is injected back into the data stream of the process graph
  3. How they can test their UDF implementation scripts (where: endpoint, with what: sample data, what is expected to be returned)
  4. Wheter the data is chunked (they don't work with the whole dataset, e.g. a spatial tile as a raster time series)

Therefore I would like to bring up the following ideas for discussion:

I have no possible solution for 4 yet, since this will be mostly an optimization within the back-ends, when it comes to parallelization.

m-mohr commented 5 years ago

I have no solution yet, but this is something that would need to be tackled not only by the API, but also by the UDF API and probably also the process specifications. Quite a bit issue we should tackle rather sooner than later. Maybe a good point to discuss in the 3rd year planning. A somewhat related issue: https://github.com/Open-EO/openeo-udf/issues/4

pramitghosh commented 5 years ago

The 1st point raised by @flahn is indeed an important one in my opinion. Tracking the change in dimensions throughout the process graph would be great. The client can then query the back-end at any point to know the dimensions so that the UDF author could code his/her UDF easily. However since a process graph could contain multiple UDF calls, the back-end has to somehow know what would be the dimensions after executing the UDF. Since there's no way to calculate it beforehand, it rests upon the user to tell it to the back-end. I had proposed annotations in the UDF script which could be parsed by the back-end before dispatching the data+code for execution in the UDF service. I believe this could be useful for attempts to parallelize the execution of UDFs too.

huhabla commented 5 years ago

To points 1 and 2: Within python UDF's the libraries numpy, xarray and geopandas are used, so that the backend is able to analyse the resulting data dimensions of the processing. In addition, the UDF developer can provide new spatio-temporal extents and resolution in the resulting UDF data objects. This is useful for example in case of aggregation operations or if the backend splits the data into tiles and the UDF changes the spatial resolution of the tiles. Hence, the backend can always detect what kind of operation was performed by the UDF by analysing the metadata of the resulting numpy arrays, xarray and geopandas data frames and compare it to the specification of the input data. It can detect dimension reduction, spatial resolution change, temporal aggregation and spatial extent modification.

To point 3: An UDF python developer should clone the UDF repository and implement his functions and tests based on the existing repository infrastructure. This can be performed without running at a dedicated backend. The UDF python implementation is designed to run independently from any backend. Using this approach, the developer has all possibilities to check and debug his UDF code on a local machine or in a docker image. And he has fully control over the input data and the processing result.

To point 4: Chunking data is the job of the backend. However, in case the backend generates tiles and the UDF code is designed to require tile overlapping, then the backend must know, how many pixel of the tile should overlap and what tile sizes should be send to UDF processes. I am not sure if we are able to implement a good generic approach for this issue. However, the backend is always able to parse the docstring of the python code and detect specific key words to setup the UDF tile size and overlap. Hence, the UDF developer must specify keywords to specify the requirements of his algorithm. But what key-words should be used? This completely depends on the implemented algorithm. A generic approach may to specify the tile size limits, the overlap limits or the dimension reduction with keywords?

How about something like this in the python code:

# This is the definition of expected size and dimension definitions of the implemented UDF
#! expected_data_object = RasterTileCollection
#! expected_dimensions = ['t', 'y', 'x']
#! tile_size_min = [1,7,9]   # in time units and pixel
#! tile_size_max = [365, 512, 512] # in time units and pixel
#! min_overlap = [0,3,4] # in time units and pixel
#! reduced_dimension = 't'
#! resolution_change = False

def rct_time_median(udf_data: UdfData):
    """Reduce the time dimension for each tile and compute the median for each pixel
    over time."""
    pass
flahn commented 5 years ago

To points 1 and 2: Within python UDF's the libraries numpy, xarray and geopandas are used, so that the backend is able to analyse the resulting data dimensions of the processing. In addition, the UDF developer can provide new spatio-temporal extents and resolution in the resulting UDF data objects. This is useful for example in case of aggregation operations or if the backend splits the data into tiles and the UDF changes the spatial resolution of the tiles. Hence, the backend can always detect what kind of operation was performed by the UDF by analysing the metadata of the resulting numpy arrays, xarray and geopandas data frames and compare it to the specification of the input data. It can detect dimension reduction, spatial resolution change, temporal aggregation and spatial extent modification.

The issue I raised aims more towards a language independent approach. I belive you that pythons frameworks are able to detect dimensional changes, but how would you model this in the JSON serialization? Do we need that or can we safely make dimensionality checks regardless the back-ends and udf services programming languages? Today I tried to come up with a new data model for the UDF data (Open-EO/openeo-udf#14) that uses the data cube model as a main assumption and relies on better dimension description. Maybe this is might be useful in this regard as well.

To point 3: An UDF python developer should clone the UDF repository and implement his functions and tests based on the existing repository infrastructure. This can be performed without running at a dedicated backend. The UDF python implementation is designed to run independently from any backend. Using this approach, the developer has all possibilities to check and debug his UDF code on a local machine or in a docker image. And he has fully control over the input data and the processing result.

I misswrote point 3. What I meant to say was, that the UDF function or UDF code needs to be tested. And for this you also need input data to really test it. Then of course you can either use the UDF implementation or at least the framework that holds the data of your desired programming language.

To point 4: Chunking data is the job of the backend. However, in case the backend generates tiles and the UDF code is designed to require tile overlapping, then the backend must know, how many pixel of the tile should overlap and what tile sizes should be send to UDF processes. I am not sure if we are able to implement a good generic approach for this issue.

I agree. Data preparation should be part of the back-ends. Maybe we have to be aware that the UDF implementations goal is quite simple: they just run a particular script on the given data and thats all.

However, the backend is always able to parse the docstring of the python code and detect specific key words to setup the UDF tile size and overlap. Hence, the UDF developer must specify keywords to specify the requirements of his algorithm. But what key-words should be used? This completely depends on the implemented algorithm. A generic approach may to specify the tile size limits, the overlap limits or the dimension reduction with keywords?

The back-end is definitly able to parse the code, but for my taste this is quite hard-coded. And what about code in R and potential other programming languages? I assume that we might need additional parameters in the run_udf process, but maybe we have already a way to pass those information on with the context object in the run_udf documentation. And hence the script developer is aware of potential overlapping and chunking, and has a way to control it. Do we need to specify those parameters more explicit? The back-ends need to deal with stitching the UDF results back together and to optimize the use and integration of UDF services into their service infrastructure.

huhabla commented 5 years ago

The issue I raised aims more towards a language independent approach. I belive you that pythons frameworks are able to detect dimensional changes, but how would you model this in the JSON serialization? Do we need that or can we safely make dimensionality checks regardless the back-ends and udf services programming languages? Today I tried to come up with a new data model for the UDF data (Open-EO/openeo-udf#14) that uses the data cube model as a main assumption and relies on better dimension description. Maybe this is might be useful in this regard as well.

The JSON definitions of RasterCollectionTiles and HyperCubes contain the spatial extent/dimension definitions and the temporal extents. Hence, the backend can check the spatio-temporal extents of the result before it transforms the JSON response from the UDF REST service into its own data format. Using key-words in the UDF code allows the backend to prepare the correct data for the request and to interpret the response data.

To point 3: An UDF python developer should clone the UDF repository and implement his functions and tests based on the existing repository infrastructure. This can be performed without running at a dedicated backend. The UDF python implementation is designed to run independently from any backend. Using this approach, the developer has all possibilities to check and debug his UDF code on a local machine or in a docker image. And he has fully control over the input data and the processing result.

I misswrote point 3. What I meant to say was, that the UDF function or UDF code needs to be tested. And for this you also need input data to really test it. Then of course you can either use the UDF implementation or at least the framework that holds the data of your desired programming language.

We can define a test dataset that each backend needs to provide for testing purpose. Or we can encourage UDF developer to implement unittests and generate the data for their algorithms within the unittests. There are some test examples available that use this approach. It allows the test of the algorithms independent of the backend.

To point 4: Chunking data is the job of the backend. However, in case the backend generates tiles and the UDF code is designed to require tile overlapping, then the backend must know, how many pixel of the tile should overlap and what tile sizes should be send to UDF processes. I am not sure if we are able to implement a good generic approach for this issue.

I agree. Data preparation should be part of the back-ends. Maybe we have to be aware that the UDF implementations goal is quite simple: they just run a particular script on the given data and thats all.

Yes, UDF's should only care about in-memory data crunching using well established processing libraries. No I/O or parallelism in the code. I/O and parallelism is the responsibility of the backend.

However, the backend is always able to parse the docstring of the python code and detect specific key words to setup the UDF tile size and overlap. Hence, the UDF developer must specify keywords to specify the requirements of his algorithm. But what key-words should be used? This completely depends on the implemented algorithm. A generic approach may to specify the tile size limits, the overlap limits or the dimension reduction with keywords?

The back-end is definitly able to parse the code, but for my taste this is quite hard-coded. And what about code in R and potential other programming languages? I assume that we might need additional parameters in the run_udf process, but maybe we have already a way to pass those information on with the context object in the run_udf documentation. And hence the script developer is aware of potential overlapping and chunking, and has a way to control it. Do we need to specify those parameters more explicit? The back-ends need to deal with stitching the UDF results back together and to optimize the use and integration of UDF services into their service infrastructure.

Be aware that UDF is Code. Usually the developer of this code is aware of its requirements and can specify the requirements within the code as comment with key-words. These key-words tell the backend howto prepare the UDF request data and what to expect as response. In case the algorithm is more generic, then the UDF caller should be able to pass arguments to the UDF. These arguments should be specified in the run_udf() method as key-value pairs (as context object for example) and passed by the backend in the UDF request to the UDF REST server. I am implementing a specific data structure in the JSON request that will store the arguments for the UDF. A UDF can access the user specific arguments in the provided data structure and dynamically modify the algorithm (kernel size, weights, ...).

A two way communication is required: -- UDF Code -> backend : tells the backend howto to prepare the request and what to expect as response -- user -> backend -> UDF : tells the UDF code howto modify the algorithm based on user requirements

m-mohr commented 4 years ago

Closing, this seems stale and not part of the Core API. Probably something to consider either in processes or the UDF repos.