Design best practice/approach to publish results into external catalog

jdries commented 2 years ago

For the EU27 croptype map, openEO generates 10000+ products. These are all separate batch jobs with their own metadata, that in fact make up a single STAC 'collection'. Instead of keeping this in openEO, I would prefer to publish these results immediately into an external STAC or opensearch catalog.

Requirements:

binary image needs to be published into external http based storage system (avoid expiring signed url's in openEO)
metadata needs to be indexed into a collection, collection name can be a parameter

So I'd like to design an approach for this. One way is to add various options to save_result, or do we perhaps need an 'export_result'?

m-mohr commented 2 years ago

We have also discussed this in the EDC project and our proposal/solution was that we simply pass over the canonical link of the STAC metadata to the EDC service and they ingest the STAC metadata from there. The initial request to EDC is done via the client (here: The Web Editor)

The main difference to your use case seems to be that:

You don't want to store the results at your back-end at all and
that you have 10000 independent batch jobs (independent from a core API perspective)

If you are only interested in 2, the collection you are producing could be used to ingest the data into an external source. One question that arises is whether the back-end can easily access and store the data at various external sources? Or should this reside at the client? We could integrate such functionality at the client-level by reading the STAC metadata with e.g. PySTAC and then let users move the data over to an arbitrary host of their choice. Or maybe these are two different use cases (external host under control of back-end / under control of user)?

Anyway, all solutions that are not directly part of the initial request would require temporary storage (point 1 above). So it may indeed be the cleanest solution to come up with a process, basically an equivalent to load_files but for storing if temporary storage at the back-end is no option. In this case the question is how many options for storage you'd have and whether you'd want to add these details to the process to whether something like /storage (comparable to /file_formats) would be required? Or is this too much of an "proprietary" approach for a "niche" use case so that it should be a simple parameter in save_result? That doesn't feel overly clean though as (1) load_result and save_result are somewhat bound to the result endpoints and (2) adding multiple storage options for all file formats seems repetitive and clunky.

Thoughts?

jdries commented 1 year ago

We indeed want to avoid storing on the backend, because files can be very large and don't want to leave it up to the user to handle cleanup. The storage system we target is S3, which is quite prevalent. Only the fact that files can be large means we need to rely on multipart upload, which is somewhat more involved than a simple HTTP POST which can be used for smaller files.

So I would avoid requiring a /storage endpoint, as S3 is supported by most storage systems nowadays. So I'm leaning towards a export_files_s3 kind of option.

m-mohr commented 1 year ago

The difficulty is the authentication, I assume? If you want to move files between a back-end and a S3 provider, the back-end needs to have a uniform way to authenticate with the S3 service or it needs to go through the user's local system, which is not ideal. Thoughts?

If back-ends don't want to leave clean-up up to the users, would it make sense to clean-up by default but let users explicitly extend the storage time so that the results are stored for another month? Because fiddling with S3 or finding a back-end for it is nothing an average user is necessarily aware of or wants to know about. It is somewhat against the simplicity of openEO. I myself wouldn't know right now what and where to host my data if I'd like to (knowing AWS for universities is not an option due to a credit card requirement).

Also, the Web Editor might get an "app-like" behavior soon where you can load results and details about it from a STAC URL, but for this people that moved over to S3 still need to figure out details like requestor pays and CORS issues, which is yet another difficulty, which you usually want to take away from users.

jdries commented 1 year ago

@LukeWeidenwalker @christophreimer linked to 'users collection' concept?

jdries commented 1 year ago

To answer questions from @m-mohr exporting to S3 would mostly target advanced users and projects that now also fiddle with S3, you are correct in saying it is less simple compared to other options we offer.

With the new 'user workspaces' option that is being proposed, we would probably end up having something similar but also with some options to make it simpler? For instance, if user workspace supports OIDC authentication, the credentials issue may be solved?

Open-EO / openeo-api

Design best practice/approach to publish results into external catalog #450