Open-EO / openeo-udf

The OpenEO UDF Python reference implementation and interface description
https://open-eo.github.io/openeo-udf/
Apache License 2.0
6 stars 1 forks source link

UDFs accessing files in backend's workspace #9

Open pramitghosh opened 5 years ago

pramitghosh commented 5 years ago

As some UDFs might need additional files to support its execution, I think having a way to exchange auxiliary files between the backend and the UDF service would be useful.

In my current implementation using R (Open-EO/openeo-r-udf), files could be sent to the UDF service when it is invoked initially by the backend. These files are then made available locally to the UDF. This, of course, means that the user (operating the client) somehow needs to tell the backend which files are needed by his/her code in the UDF (or send the entire workspace for the process in question - perhaps a bad idea). Additionally, files could be imported from any source over the internet.

Another possibility (as discussed with @m-mohr) would be to make the backend's workspace available on demand to the UDFs at runtime. So, for example, the UDF service could provide a function (e.g. import_backend_data(<filename>)) which could access the backend's workspace through an authentication mechanism (e.g. an access token that is passed on when the UDF service was invoked initially) and import the desired file. In general, this would be possible if the backend makes its workspace available to the UDF service through an endpoint. IMHO this approach may be time consuming as connecting to the backend and transferring data multiple times requires quite a lot of overhead. I believe caching could help but only if the same set of files are required by multiple UDFs in the same process. (But, then again, how do we know which ones will be required frequently by the UDF beforehand unless the user mentions it explicitly?)

Looking forward to hear everyone's opinions on this. Thanks!

m-mohr commented 5 years ago

Another option that came to my mind: We previously thought about bundling UDFs in some kind of archive (e.g. special zip file), which includes the code, a metadata file and could potentially also include additional files. That would be very convenient as users can simply share their UDF in a single file with all (non-runtime) dependencies. This wouldn't work for sending code directly as string to the UDF server, but I see this also more as a way for very simple UDFs. More complex UDFs should use the bundle.

(I have a very user-centric view, I don't really care how the files are actually transferred between back-end and UDF service. That's an implementation detail.)

pramitghosh commented 5 years ago

Yes, true. My current implementation uses a ZIP file (encoded as a string) that the backend passes over to the UDF service. It contains:

  1. the actual data in GeoTIFF format organized in directories (making this optional could also be an option as the user might want to import the data from sources other than the backend)
  2. metadata file(s): Having a description of the UDF, author information, dimension info (could help in making it scalable) etc.
  3. license file: containing licensing info
  4. a /data directory: containing all other auxiliary files and directories - e.g. additional R scripts, C++ scripts, any other file(s) etc. These are copied over to the UDF's working dir before the UDF is executed and are, hence, accessible to the UDF.

Metadata and license files have not been used yet to keep things simple for testing but could be present anyway without causing any harm. The only things that are not included are:

  1. a "legend" file containing info on the path, band names, timestamps etc. of the GeoTIFF files: this is currently passed to the UDF service as a JSON array
  2. the user's code: this is passed as a string embedded in the JSON. Of course, this could contain source() statements to run other R scripts in the /data directory.

But, these could easily be put into the ZIP file making the UDF's operation reproducible. As you said, this would also make it easy to share and store in a database (say, after indexing on some fields in the metadata file(s) such as dimension info) - somewhat similar to R packages in CRAN. Of course, the actual data (GeoTIFF files) should be removed.