Language-agnostic UDFs - Githubissues

Open-EO / openeo-api

The openEO API specification

http://api.openeo.org

Apache License 2.0

91 stars 11 forks source link

Language-agnostic UDFs #87

Closed m-mohr closed 5 years ago

m-mohr commented 6 years ago

So we are working on different implementations for UDFs.

@huhabla is working on Python: https://github.com/Open-EO/openeo-udf
@pramitghosh is working on R: https://github.com/pramitghosh/OpenEO.R.UDF
and we also had the related work from @aljacob and @jdries at the EOEP hackathon: EURAC & VITO

It would be good to exchange ideas in the next weeks and add them here.

I am currently stripping down the API to only include endpoints that are considered to be at least in a release candidate state with the release of v0.3. UDFs seem to be still in early alpha, at least in the API definition. Therefore I removed the current specification:

openapi: 3.0.1
tags:
  - name: UDF Runtime Discovery
    description: >-
      Discovery of programming languages and their runtime environments to
      execute user-defined functions at the back-end.
paths:
  /udf_runtimes:
    get:
      summary: >-
        Returns the programming languages including their environments and UDF
        types supported.
      description: >-
        Describes how custom user-defined functions can be exposed to the data
        and which programming languages and environments are supported by the
        back-end.
      tags:
        - UDF Runtime Discovery
      security:
        - {}
        - Bearer: []
      responses:
        200:
          description: Description of UDF runtime support
          content:
            application/json:
              schema:
                type: array
                items:
                  type: object
                  description: >-
                    A map with language identifiers such as `R` as keys and an
                    object that defines available versions, extension packages,
                    and UDF schemas.
                  additionalProperties:
                    type: object
                    properties:
                      udf_types:
                        type: array
                        items:
                          $ref: '#/components/schemas/udf_type'
                      versions:
                        type: object
                        description: >-
                          A map with version identifiers as keys and an object
                          value that specifies which extension packages are
                          available for the particular version.
                        additionalProperties:
                          description: >-
                            Extension package identifiers that should include
                            their version number such as `'sf__0.5-4'`
                          properties:
                            packages:
                              type: array
                              items:
                                type: string
                          type: object
              examples:
                response:
                  value:
                    R:
                      udf_types:
                        - reduce_time
                        - reduce_space
                        - apply_pixel
                      versions:
                        3.1.0:
                          packages:
                            - Rcpp_0.12.10
                            - sp_1.2-5
                            - rmarkdown_1.6
                        3.3.3:
                          packages:
                            - Rcpp_0.12.10
                            - sf_0.5-4
                            - spacetime_1.2-0
        4XX:
          $ref: '#/components/responses/client_error_auth'
        5XX:
          $ref: '#/components/responses/server_error'
  '/udf_runtimes/{lang}/{udf_type}':
    parameters:
      - name: lang
        in: path
        description: Language identifier such as `R`
        required: true
        schema:
          type: string
          enum:
            - python
            - R
      - name: udf_type
        in: path
        description: >-
          The UDF types define how UDFs can be exposed to the data, how they can
          be parallelized, and how the result schema should be structured.
        required: true
        schema:
          type: string
          enum:
            - apply_pixel
            - apply_scene
            - reduce_time
            - reduce_space
            - window_time
            - window_space
            - window_spacetime
            - aggregate_time
            - aggregate_space
            - aggregate_spacetime
    get:
      summary: Returns the process description of UDF schemas.
      description: >-
        Returns the process description of UDF schemas, which offer different
        possibilities how user-defined scripts can be applied to the data.
      tags:
        - UDF Runtime Discovery
      security:
        - {}
        - Bearer: []
      responses:
        200:
          description: Process description
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/udf_description'
              examples:
                response:
                  value:
                    process_id: /udf/R/reduce_time
                    description: >-
                      Applies the given R script on all time series of the input
                      imagery. The R script gets pixel values (all bands) of
                      complete time series as input and must result in a single
                      value or tuple for multiple bands.
                    args:
                      imagery:
                        description: input image time series
                      script:
                        description: R script that will be applied over time series
        4XX:
          $ref: '#/components/responses/client_error_auth'
        5XX:
          $ref: '#/components/responses/server_error'
components:
  schemas:
    udf_type:
      type: string
      description: >-
        The UDF types define how UDFs can be exposed to the data, how they can
        be parallelized, and how the result schema should be structured.
      enum:
        - apply_pixel
        - apply_scene
        - reduce_time
        - reduce_space
        - window_time
        - window_space
        - window_spacetime
        - agregate_time
        - aggregate_space
        - aggregate_spacetime
    udf_description:
      description: >-
        Defines and describes a UDF using the same schema as the description of
        processes offered by the back-end.
      type: object
      required:
        - process_id
        - description
      properties:
        process_id:
          type: string
          description: The unique identifier of the process.
        description:
          type: string
          description: >-
            A short and concise description of what the process does and how the
            output looks like.
        link:
          type: string
          description: >-
            Reference to an external process definition if the process has been
            defined over different back ends within OpenEO
        args:
          type: object
          additionalProperties:
            type: object
            required:
              - description
            properties:
              description:
                type: string
                description: A short and concise description of the process argument.
              required:
                type: boolean
                default: true
                description: Defines whether an argument is required or optional.
            additionalProperties: true
      example:
        process_id: udf/R/reduce_time
        description: >-
          Applies an R function independently over all input time series that
          produces a zero-dimensional value (scalar or multi-band tuple) as
          output (per time series).
        args:
          imagery:
            description: input (image) time series
            required: true
          script:
            description: 'Script resource that has been uploaded to user space before. '
            required: true

jdries commented 6 years ago

Hi Matthias,

my current plan for UDF's is to start supporting the proposal of @huhabla in the GeoPySpark backend.

Hope to have feedback and something working by the hackathon.

best regards, Jeroen

m-mohr commented 6 years ago

@GreatEmerald shared in our chat:

During the Proba-V Symposium some of our team got to talk to Leslie Gale from Space Applications, who shared a bit about what they have achieved in the EOPEN project so far. There was a demonstrator on how they tackled the issue of UDFs: they have a web frontend for generating Docker definition files. So the user selects what dependencies to deploy on it, and the site generates boilerplate Docker definition that you can then edit, or upload your script to be included into it. And then those files get uploaded to a backend and processing is done. Leslie said that we could just reuse the same solution in openEO as well, the code is also open and out there. Perhaps there would also be a way to integrate EOPEN as a frontend or so in openEO as well.

GreatEmerald commented 6 years ago

Yes, I'll post more info on that once I get a reply.

Looking at the issue of language-specific UDF support vs something based on Docker, it feels to me that the former currently is aimed at relatively basic processing (e.g. computing vegetation indices), as opposed to something complex (e.g. running custom time series breakpoint analysis with specific R package versions).

On one hand, as far as I can tell the simpler language-specific approach was what was initially envisioned for openEO, but then a Docker-based solution could be quite a bit more flexible (if more difficult for the user to set up).

huhabla commented 6 years ago

Hi,

2018-06-07 17:23 GMT+02:00 Dainius Masiliūnas notifications@github.com:

Yes, I'll post more info on that once I get a reply.

Looking at the issue of language-specific UDF support vs something based on Docker, it feels to me that the former currently is aimed at relatively basic processing (e.g. computing vegetation indices), as opposed to something complex (e.g. running custom time series breakpoint analysis with specific R package versions).

The Python reference implementation [1] uses numpy, pandas, geopandas, shapely and pygdal for processing and therefore supports a wide range of application. All these libraries can/must be used in UDF python implementations. In addition we want to support pytorch, tensorflow and scikit-learn based pre-trained models to allow the application of machine learn models to geographical data.

On one hand, as far as I can tell the simpler language-specific approach was what was initially envisioned for openEO, but then a Docker-based solution could be quite a bit more flexible (if more difficult for the user to set up).

The current UDF approach can be deployed using docker as well. The reference implementation already supports docker deployment. However, docker is just a method to separate processing environments. Maybe the backend providers should decide if they want to deploy a docker swarm environment for UDF or not?

The most flexible approach indeed would be to support "user defined docker container" (u2dc) that mount the backend data for processing ... ? In this case we do not need to provide an OpenEO UDF API, the user is free to deploy any environment in the container that suits him. We just need to specify how the backend should provide read- and writable data to be mounted in the container.

Best regards Sören

[1] https://github.com/Open-EO/openeo-udf

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Open-EO/openeo-api/issues/87#issuecomment-395461040, or mute the thread https://github.com/notifications/unsubscribe-auth/AKD0jVisi4-jcIcxxS_A1MDafg_5RSLxks5t6UVbgaJpZM4UZBjc .

pramitghosh commented 6 years ago

In order to maintain the interoperability of the UDFs with the backends, I believe some conventions are to be agreed upon. I was thinking more in the terms of a file-based system for transferring data to and from the backends for executing the UDFs. So, for example, the I/O for rasters could be in single-band GeoTIFFs (or Cloud Optimized GeoTIFFs as suggested by @m-mohr) in a specific directory structure and/or naming conventions with some more generic (say ASCII) files for metadata. Something similar could be thought of for feature and timeseries data too. A brief description of the strategy I am planning is here: https://github.com/pramitghosh/OpenEO.R.UDF#general-strategy. The backends need to provide the input in consistent formats and also the UDFs need to write the outputs to disk in consistent formats so that the backends could read it back again. Also, I think keeping the formats generic would help to ensure compatibility. I am already coordinating with @flahn for the R backend regarding this and it would be to discuss these issues with everyone for the other backends too. The issue of external dependencies could be solved, at least for the implementation using R, if the user provides them as a comma separated string in a text file along with the script, for example.

I was wondering if @huhabla is thinking in similar directions as well for the UDF functionality using Python.

Some important points to ponder upon:

Whether using files-based systems for interacting back and forth with the backends?
Directory structure and/or file-naming conventions of the data being written to disk. For UDFs from the perspective of R, my current draft implementation reads something like this (surely would change over the course of development)
- Draft Directory structure / naming conventions for input data for UDFs: https://github.com/pramitghosh/OpenEO.R.UDF/tree/master/data/example_udf_in
- Draft Directory structure / naming conventions for output data for UDFs: https://github.com/pramitghosh/OpenEO.R.UDF/tree/master/data/example_udf_out
- Draft UDF script from user: https://github.com/pramitghosh/OpenEO.R.UDF/tree/master/data/example_udf
A number of low-level decisions - e.g. handling of the dimensionality of the data (how many and what dimensions would be supported), their semantics (e.g. multi-band GeoTIFFs where the bands do not correspond to the usual spectral bands but, say, just min, median and max over a timeseries) etc.

Some personal opinions on using Docker and Language-agnostic UDFs:

One thing I wanted to point out is the use of Docker for the UDFs. Since UDFs (whether using Docker or not) will involve reading and writing data to disk it is bound to be time-consuming in my opinion. I'm not sure but on top of this, introducing Docker might result in big performance issues which will in turn probably affect the user monetarily.
Furthermore, Docker might be too fancy for a small subset of potential OpenEO users without advanced programming knowledge who would prefer to keep their UDFs simple.
Correct me if I'm wrong but I believe making UDFs language-agnostic would imply the users could, in principle, structure their output in any format using any language of their choice which will have to be then read in by the backend. I think this could be a problem since in case there is some inconsistency in the UDF output structure/format such that it is not parsable by the backend which called it, it would throw an error after processing the whole UDF which would mean loss of computing time.

I would love to have everyone's opinion on these, at least on the strategy for the UDFs for now, since the UDFs are intricately connected to a number of other components as evident from today's (7.6.18) telco. Thanks!

huhabla commented 6 years ago

Hi,

2018-06-07 19:30 GMT+02:00 Pramit Ghosh notifications@github.com:

In order to maintain the interoperability of the UDFs with the backends, I believe some conventions are to be agreed upon. I was thinking more in the terms of a file-based system for transferring data to and from the backends for executing the UDFs. So, for example, the I/O for rasters could be in single-band GeoTIFFs (or Cloud Optimized GeoTIFFs as suggested by @m-mohr https://github.com/m-mohr) in a specific directory structure and/or naming conventions with some more generic (say ASCII) files for metadata. Something similar could be thought of for feature and timeseries data too. A brief description of the strategy I am planning is here: https://github.com/pramitghosh/OpenEO.R.UDF#general-strategy. The backends need to provide the input in consistent formats and also the UDFs need to write the outputs to disk in consistent formats so that the backends could read it back again. Also, I think keeping the formats generic would help to ensure compatibility. I am already coordinating with @flahn https://github.com/flahn for the R backend regarding this and it would be to discuss these issues with everyone for the other backends too. The issue of external dependencies could be solved, at least for the implementation using R, if the user provides them as a comma separated string in a text file along with the script, for example.

I was wondering if @huhabla https://github.com/huhabla is thinking in similar directions as well for the UDF functionality using Python.

The approach that i have implemented includes a file-based system. However, the UDF API is designed to be independent from the data formats of the backends. GRASS GIS for example has its own raster and vector format. A file based approach using GeoTiff's or other formats introduces costly I/O effort for data conversion/export/import.

I think that the backends knows best howto handle its data efficiently, so i designed an abstract data representation (documented with swagger 2.0) that is independent from the programming language and can be represented using JSON [1]. Any programming language that has basic datatypes like lists, arrays, strings, floats and maps can implement this API.

I implemented a python UDF reference prototype [2], based on the abstract definition, that makes use of the most common and powerful geo-libraries available in python: numpy, pandas, geopandas, shapely and pygdal which all depend on GEOS, GDAL and OGR. For example, a raster time series is represented as three dimensional numpy array that can be used in many other python libraries. Vector data is represented as GeoDataFrame that provides many functionalities that are implemented in GEOS (buffer, overlay, area, topology check, ...). The python UDF prototype allows the back and forth conversion of the python datatypes into JSON representations.

I implemented two approaches based on the python prototype to run UDF's:

An executable [3] that takes as argument a list of multi-band GeoTiff (or other GDAL format) files, a list of vector files, the UDF python file and writes GeoTiff and geopackages as output
An UDF REST service that has a POST endpoint. It takes JSON definition of input data and the python code as input and created JSON data as output [4]

The executable is designed to work with files of different formats and apply the python UDF on the data. It reads all data into memory at the moment. However, i will re-dedisgn it, so that only tiles of raster and vector data are read into memory and distributed in parallel to the UDF for processing. This reduces memory issues and makes use of multi-core systems.

This executable can be used for file-based systems and can run in a docker container.

The UDF REST service can be used by any application (i.e javascript based) that manages its data with JSON or can simply convert it into JSON, to execute python UDF's on its data.

[1] https://github.com/Open-EO/openeo-udf/blob/master/src/openeo_udf/server/definitions.py [2] https://github.com/Open-EO/openeo-udf/blob/master/src/openeo_udf/api/base.py

[3] https://github.com/Open-EO/openeo-udf#using-the-udf-command-line-tool [4] https://github.com/Open-EO/openeo-udf#using-the-udf-server

Some important points to ponder upon:

Whether using files-based systems for interacting back and forth with the backends?

Directory structure and/or file-naming conventions of the data being written to disk. For UDFs from the perspective of R, my current draft implementation reads something like this (surely would change over the course of development)

Draft Directory structure / naming conventions for input data for UDFs: https://github.com/pramitghosh/OpenEO.R.UDF/tree/ master/data/example_udf_in>

Draft Directory structure / naming conventions for output data for UDFs: https://github.com/pramitghosh/OpenEO.R.UDF/tree/ master/data/example_udf_out

Draft UDF script from user: https://github.com/ pramitghosh/OpenEO.R.UDF/tree/master/data/example_udf https://github.com/pramitghosh/OpenEO.R.UDF/tree/master/data/example_udf

A number of low-level decisions - e.g. handling of the dimensionality of the data (how many and what dimensions would be supported), their semantics (e.g. multi-band GeoTIFFs where the bands do not correspond to the usual spectral bands but, say, just min, median and max over a timeseries) etc.

I think the approach i have implemented takes care of these topics/issues. It supports file based approaches, it defines abstract datatypes that take care of the dimensionality and multi band data. It supports raster, vector and general structured data like statistical analysis result of multidimensional arrays [1]. I have implemented several UDF's to demonstrate its capabilities [2].

[1] https://github.com/Open-EO/openeo-udf/blob/master/src/openeo_udf/functions/raster_collections_statistics.py [2] https://github.com/Open-EO/openeo-udf/tree/master/src/openeo_udf/functions

Some personal opinions on using Docker and Language-agnostic UDFs:

One thing I wanted to point out is the use of Docker for the UDFs. Since UDFs (whether using Docker or not) will involve reading and writing data to disk it is bound to be time-consuming in my opinion. I'm not sure but on top of this, introducing Docker might result in big performance issues which will in turn probably affect the user monetarily.

I think a file-based system will always involve reading and writing data from and to disks, independent from docker.

Furthermore, Docker might be too fancy for a small subset of potential OpenEO users without advanced programming knowledge who would prefer to keep their UDFs simple.

Correct me if I'm wrong but I believe making UDFs language-agnostic would imply the users could, in principle, structure their output in any format using any language of their choice which will have to be then read in by the backend. I think this could be a problem since in case there is some inconsistency in the UDF output structure/format such that it is not parsable by the backend which called it, it would throw an error after processing the whole UDF which would mean loss of computing time.

In my opinion should the backend provider decide to use docker for process environment separation. The UDF developer should not care about this.

Best regards Sören

I would love to have everyone's opinion on these, at least on the strategy for the UDFs for now, since the UDFs are intricately connected to a number of other components as evident from today's (7.6.18) telco. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Open-EO/openeo-api/issues/87#issuecomment-395502390, or mute the thread https://github.com/notifications/unsubscribe-auth/AKD0jXhcMNryF6Jh5fIXLV1oncF2PdYoks5t6WNSgaJpZM4UZBjc .

pramitghosh commented 6 years ago

Dear Sören,

Thanks @huhabla for describing your approach for UDFs in Python. You are right that the backends usually have their own "native" file formats which they are more comfortable in which might not be GeoTIFFs. If the format is supported by GDAL/OGR we are good to go as such. I will try to incorporate this point in the implementation using R too.

Just to further clarify some points regarding your Python implementation could you please comment if I got these following points right:

You are also using a file-based system for I/O to and from the UDF environment but the data formats are not necessarily GeoTIFFs but could be anything readable by GDAL/OGR. These binary files are supported by JSON files containing some metadata on them which would be parsed by your UDF implementation.
Internally you are converting these binary files into ADTs using some standard Python-specific implementations - such as NumPy arrays for rasters - on which the users' Python script containing the UDF would be run. (However, whatever is coming in or going out are binary.)
I'm not sure about this but if the user has his/her UDF in some other language then are you are exporting the NumPy arrays (or other Python-specific data structures) to some generic formats on which the UDF is run?
After the UDF has run, the output is again written in binary (along with a JSON?) for the backends to read back.

If the above points are right I would say I am thinking on the same lines too (except point 3 above) when looking at the UDF implementations externally as a blackbox - apart from a few file format differences e.g. using specific generic formats like GeoTIFF for rasters currently (which I could change to GDAL/OGR readable ones without too much hassle), using CSV for storing the metadata instead of JSON (this could be changed too to make them conform more to each other - even if not exactly).

However, one thing that I am concerned a bit with is the interfacing of the UDF implementations with the different backends. I think once we come to a common ground regarding the I/O formats and structure it would be easier for the backend devs to make the backends communicate more easily with both the UDF implementations.

As a side note, I'm not sure but is converting multi-temporal multi-band GeoTIFFs to simple ADTs like arrays and lists a good idea? Will this not blow up already big data significantly?

Thanks!

jdries commented 6 years ago

Note that this is the first issue where we actually start to assume that backends write files to disk at some point, except perhaps for creating a result that can be downloaded. This has a pretty major impact on the ability to do synchronous calls and web services. In the proposal we had the concept of a 'file based API'. Should we perhaps see this issue as a first part of that API? And should this also affect other parts of the API? For instance, when working file based, it does make sense to have an OpenSearch catalogue allowing you to search for scenes that need to be provided as input. This is different from the current, 'datacube', approach. Another option might be to find a way to stream tiles that are in memory in the backend into the docker container directly.

huhabla commented 6 years ago

Hi Jeroen,

Am Fr., 8. Juni 2018 um 12:27 Uhr schrieb Jeroen Dries < notifications@github.com>:

Note that this is the first issue where we actually start to assume that backends write files to disk at some point, except perhaps for creating a result that can be downloaded. This has a pretty major impact on the ability to do synchronous calls and web services. In the proposal we had the concept of a 'file based API'. Should we perhaps see this issue as a first part of that API? And should this also affect other parts of the API? For instance, when working file based, it does make sense to have an OpenSearch catalogue allowing you to search for scenes that need to be provided as input. This is different from the current, 'datacube', approach.

I think the UDF API should be independent from the file format of the backend, file-based or not. File-based backends must ensure to provide the UDF runtime environment data in memory. Hence, the UDF server and executable approach to show examples howto implement the UDF swagger 2.0 approach in a backend.

Another option might be to find a way to stream tiles that are in memory in the backend into the docker container directly.

The UDF REST test server supports this already. You can send a JSON representation of a 3d array and vector features as GeoJSON to the server and get JSON representation back. However, JSON may not be the fastest and smallest choice, but UBJSON[1] may be.

[1] http://ubjson.org/

Best regards Sören

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Open-EO/openeo-api/issues/87#issuecomment-395720040, or mute the thread https://github.com/notifications/unsubscribe-auth/AKD0jcSy98NyXYo0aRhuUlUhAoK2il-_ks5t6lGNgaJpZM4UZBjc .

huhabla commented 6 years ago

Dear Pramit,

Am Fr., 8. Juni 2018 um 12:14 Uhr schrieb Pramit Ghosh < notifications@github.com>:

Dear Sören,

Thanks @huhabla https://github.com/huhabla for describing your approach for UDFs in Python. You are right that the backends usually have their own "native" file formats which they are more comfortable in which might not be GeoTIFFs. If the format is supported by GDAL/OGR we are good to go as such. I will try to incorporate this point in the implementation using R too.

Just to further clarify some points regarding your Python implementation could you please comment if I got these following points right:

You are also using a file-based system for I/O to and from the UDF environment but the data formats are not necessarily GeoTIFFs but could be anything readable by GDAL/OGR. These binary files are supported by JSON files containing some metadata on them which would be parsed by your UDF implementation.

Internally you are converting these binary files into ADTs using some standard Python-specific implementations - such as NumPy arrays for rasters - on which the users' Python script containing the UDF would be run. (However, whatever is coming in or going out are binary.)

I'm not sure about this but if the user has his/her UDF in some other language then are you are exporting the NumPy arrays (or other Python-specific data structures) to some generic formats on which the UDF is run?

After the UDF has run, the output is again written in binary (along with a JSON?) for the backends to read back.

If the above points are right I would say I am thinking on the same lines too (except point 3 above) when looking at the UDF implementations externally as a blackbox - apart from a few file format differences e.g. using specific generic formats like GeoTIFF for rasters currently (which I could change to GDAL/OGR readable ones without too much hassle), using CSV for storing the metadata instead of JSON (this could be changed too to make them conform more to each other - even if not exactly).

The UDF API is designed to be independent of the file format and storage approach of the backends. The python prototype of the UDF API implements the swagger 2.0 schemas SpatialExtent, RasterCollectionTile, VectorCollectionTile, StructuredData and UdfData [1] as python objects [2] (numpy-, geopandas-, shapely-objects). Python UDF's work on these python objects that provide a wide range of manipulation functions based on numpy and GEOS.

The backends are responsible to create the UDF environment and convert their data (GeoTiff, GRASS raster, data cubes, ...) into the UDF python objects and they must write the result of the UDF processing back into their database.

I have implemented the "execute_udf" program to demonstrate this backend approach. The "execute_udf" program reads GDAL/ODF supported data formats and converts them into the UDF python objects (in memory) that the UDF code works on. It converts the resulting python objects after processing back into GeoTiff or geopackage files in a specific directory. The UDF code doesn't care about the data handling, the backend does. Please have a look at the UDF example implementations [3].

The UDF python objects have methods to represent them-self as JSON strings and to create them-self from JSON strings. This made it easy to implement the second approach: the "UDF REST test server". This server exposes a HTTP POST endpoint. The endpoint requires that the data is provided as JSON format following the UDF API swagger 2.0 definition (UdfData with RastercollectionTiles and VectorCollectionTiles).

The server converts the JSON representation into python objects and runs the UDF python code on it. After processing it will convert the computational result from python objects into the JSON representation and returns the result as JSON response.

Hence, JSON is not used for metadata storage but as data exchange format in the "UDF REST test server" and only there.

If you want to have UDF's in a different language, then you must implement the swagger 2.0 UDF API description with this language. If you use object oriented language then the swagger schemas will be implemented as classes that instantiate objects the UDF should work on. Depending from the programming language, common datatypes must be used to represent raster and vector data, like arrays for raster and tables with geometry columns for vector data. It depends on the capabilities and available libraries for the chosen programming language.

[1] https://github.com/Open-EO/openeo-udf/blob/master/src/openeo_udf/server/definitions.py [2] https://github.com/Open-EO/openeo-udf/blob/master/src/openeo_udf/api/base.py [3] https://github.com/Open-EO/openeo-udf/tree/master/src/openeo_udf/functions

However, one thing that I am concerned a bit with is the interfacing of the UDF implementations with the different backends. I think once we come to a common ground regarding the I/O formats and structure it would be easier for the backend devs to make the backends communicate more easily with both the UDF implementations.

As a side note, I'm not sure but is converting multi-temporal multi-band GeoTIFFs to simple ADTs like arrays and lists a good idea? Will this not blow up already big data significantly?

The UDF API swagger schema RasterCollectionTile includes a three dimensional array with index schema [time][y][x]. The time stamps for each x,y slice are stored in start- and end-time arrays. In addition the spatial extent with pixel width and height are stored in this schema. A single RasterCollectionTile represents a singel band or a scalar field like temperature. A list of tiles represents multi-band data.

Best regards Sören

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Open-EO/openeo-api/issues/87#issuecomment-395717126, or mute the thread https://github.com/notifications/unsubscribe-auth/AKD0jXSXzmdDN-GbkjnpP1tIBnpNzSPnks5t6k5tgaJpZM4UZBjc .

huhabla commented 6 years ago

Hi,

[snip]

As a side note, I'm not sure but is converting multi-temporal multi-band

GeoTIFFs to simple ADTs like arrays and lists a good idea? Will this not blow up already big data significantly?

The UDF API swagger schema RasterCollectionTile includes a three dimensional array with index schema [time][y][x]. The time stamps for each x,y slice are stored in start- and end-time arrays. In addition the spatial extent with pixel width and height are stored in this schema. A single RasterCollectionTile represents a singel band or a scalar field like temperature. A list of tiles represents multi-band data.

I forgot to mention that the backend is responsible to create the raster and feature collection tiles (python objects) of a specific size, so that huge time series data will not fill up the main memory. If you think of a time series as a data cube, then a tile is a small 3d subset of the cube with a problem specific spatio-temporal extent and overlaps to allow neighborhood, re-sampling and aggregation operations.

Best regards Sören

Best regards Sören

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Open-EO/openeo-api/issues/87#issuecomment-395717126, or mute the thread https://github.com/notifications/unsubscribe-auth/AKD0jXSXzmdDN-GbkjnpP1tIBnpNzSPnks5t6k5tgaJpZM4UZBjc .

pramitghosh commented 6 years ago

Dear Sören,

Thanks for the detailed explanation of the Python implementation. Yes, the users' UDF need not worry about file handling (I was a bit worried about the UDFs in other languages part).

So, what I get now is that the data is converted to Python objects and the users' UDF runs on the object. The results are converted back again to GDAL readable formats for the backend. This is done by the program execute_udf. Another thing is that the whole thing happens in memory and the data is converted to Python objects by the backend.

For the R UDFs, I seem to have a similar direction except for the following major differences:

The data from the backend are written to disk physically (as Jeroen @jdries mentioned) at a location where both the backend and the UDF server has read/write access.
The data here is also converted to objects in R (which would provide the users writing the UDFs a consistent structure to work upon) but this conversion is independent of the backend since it is being done by a separate R package which is being developed (https://github.com/pramitghosh/OpenEO.R.UDF). This has to be (pre-)installed in the servers executing R UDFs. The conversion of the results back to GDAL readable formats for the backend would be taken care by this package too.
As for the UDFs themselves, the user writes their function definition in a script file and calls their own function in the same file (just as in the Python UDF examples you provided here: https://github.com/Open-EO/openeo-udf/tree/master/src/openeo_udf/functions). The only difference in the R implementation is that the user calls his/her own function as an argument to a function run_UDF() defined in the package I mentioned above.

So, the actual UDF "function", for example, could look as simple as this:

my_func = function(obj) {median(obj)}

and this could be called as

run_UDF(legend_name = "legend.csv", function_name = my_func, drop_dim = 4)

The arguments legend_name and drop_dim could eventually be omitted so that the call looks like

run_UDF(my_func)

So from the R perspective the exporting and importing is done by the backend (I am already coordinating with @flahn regarding this) but the conversion into objects is done by the server executing the R UDF. I think writing files to disk could have some advantages in the future such as keeping the memory cleaner, applying some other tools which does something with the files in place (maybe some sort of pre-/post-processing, for example), processing the individual files parallely etc.

Looking forward to hear everyone's opinions and/or suggestions on this approach.

Thanks!

huhabla commented 6 years ago

Dear Pramit, an important difference between python and R UDF' is, that several backends have a direct python interface. GRASS GIS and GeoTrellis (via geopyspark) for example. They can directly implement the OpenEO python UDF interface and use the library i implemented to create the UDF python objects from their database data.

Best regards Sören

Am Mo., 11. Juni 2018 um 11:08 Uhr schrieb Pramit Ghosh < notifications@github.com>:

Dear Sören,

Thanks for the detailed explanation of the Python implementation. Yes, the users' UDF need not worry about file handling (I was a bit worried about the UDFs in other languages part).

So, what I get now is that the data is converted to Python objects and the users' UDF runs on the object. The results are converted back again to GDAL readable formats for the backend. This is done by the program execute_udf. Another thing is that the whole thing happens in memory and the data is converted to Python objects by the backend.

For the R UDFs, I seem to have a similar direction except for the following major differences:

The data from the backend are written to disk physically (as Jeroen @jdries https://github.com/jdries mentioned) at a location where both the backend and the UDF server has read/write access.

The data here is also converted to objects in R (which would provide the users writing the UDFs a consistent structure to work upon) but this conversion is independent of the backend since it is being done by a separate R package which is being developed ( https://github.com/pramitghosh/OpenEO.R.UDF). This has to be (pre-)installed in the servers executing R UDFs. The conversion of the results back to GDAL readable formats for the backend would be taken care by this package too.

As for the UDFs themselves, the user writes their function definition in a script file and calls their own function in the same file (just as in the Python UDF examples you provided here: https://github.com/Open-EO/openeo-udf/tree/master/src/openeo_udf/functions). The only difference in the R implementation is that the user calls his/her own function as an argument to a function run_UDF() defined in the package I mentioned above.

So, the actual UDF "function", for example, could look as simple as this:

my_func = function(obj) {median(obj)}

and this could be called as

run_UDF(legend_name = "legend.csv", function_name = my_func, drop_dim = 4)

The arguments legend_name and drop_dim could eventually be omitted so that the call looks like

run_UDF(my_func)

So from the R perspective the exporting and importing is done by the backend (I am already coordinating with @flahn https://github.com/flahn regarding this) but the conversion into objects is done by the server executing the R UDF. I think writing files to disk could have some advantages in the future such as keeping the memory cleaner, applying some other tools which does something with the files in place (maybe some sort of pre-/post-processing, for example), processing the individual files parallely etc.

Looking forward to hear everyone's opinions and/or suggestions on this approach.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Open-EO/openeo-api/issues/87#issuecomment-396175091, or mute the thread https://github.com/notifications/unsubscribe-auth/AKD0jZQRGavb_MCJu18s-fhO7aMcMvQcks5t7jOegaJpZM4UZBjc .

edzer commented 6 years ago

Yes, of course, but at the cost of being no longer language-agnostic. It would be a pity if we'd be dropping that, as a goal, now.

huhabla commented 6 years ago

I am not sure if i understand this correctly. Why do we drop this goal now? The OpenEO UDF swagger 2.0 description is IMHO language agnostic. The python reference implementation is based on the swagger description. It assures that the same python UDF code without any modification will run on different backends. The idea is that python UDF's work on python OpenEO UDF API objects (SpatialExtent, RasterCollectionTile, VectorCollectionTile, StructuredData, UdfData that make use of numpy, geopandas and shapely), not on backend specific datatypes. If a backend has a python interface, then it will be easier to implement the python UDF support in the backend by using the python OpenEO UDF API reference library.

edzer commented 6 years ago

OK, I probably misunderstood. Let's have a chat some time next week!

m-mohr commented 6 years ago

It seems like there is a lot of confusion regarding the UDFs. This should be presented and discussed with all interested parties in the next weeks, maybe during one of the next dev telcos?!

m-mohr commented 6 years ago

Will be presented and discussed in the next dev telco on 21/06/2018 2pm.

m-mohr commented 5 years ago

After the discussions at VITO, we decided to add /udf_runtimes again with the following content:

List of runtimes:
- identifier plus
- programming language, version, libraries + versions or
- docker identifier
a default environment

m-mohr commented 5 years ago

In the last dev telco we discussed to change the approach a bit and group the runtimes by programming language. Each can have multiple versions and a default version (usually latest).

m-mohr commented 5 years ago

UDF runtimes are implemented in the API. All further work will be tackled in openeo-udf, openeo-r-udf, openeo-processes etc.