Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 15 forks source link

Process to load a vector cube #322

Closed soxofaan closed 1 year ago

soxofaan commented 2 years ago

While there are various discussions about how to conceptually define and handle vector cubes, I don't think we have already a standardized solution to load the vector data in the first place (except for inline GeoJSON).

I'll first try to list a couple of vector loading scenario's (with varying degrees of practicality and usefulness) and initial discussion of possible solutions (if any)

m-mohr commented 2 years ago

Have you had a look at #319 (WIP)?

none of the current back-ends (or clients) implemented the user file storage "microservice" of the openEO API

I'm confused. It's present since years in JS, Web Editor and R and also present in the GEE back-end. EODC also claims that it's available.

users often want to share files with other users

Indeed, sharing was never implemented or defined due to lack of time (for implementation). I'd be happy to work on a specification, but I don't see who would implement that anytime soon as other long-existing features such as the pure existance of a user file storage are not even present yet.

load from URL

That's something that is also missing from raster cubes, although for both vector and raster you could likely re-use load_result with a URL although it requires STAC, which is not really embracing vector data yet.

load_collection originally supported vector cubes, but that was removed

Yes, but actually only because we had not definition for vector cubes yet and the behavior was undefined. The intention there was to re-add it once we have that defined, which we are getting closer to.

soxofaan commented 2 years ago

I'm confused. It's present since years in JS, Web Editor

My bad indeed, too much VITO-oriented assumptions apparently, I should have explored this more.

load from URL That's something that is also missing from raster cubes,

Yes, but I think it's significantly more important for vector data. Handling raster data is usually a big data problem, which is not ideal to solve with user-facing URLs. (Input) vector data will typically be small data (e.g. just one relatively small file), and handling them through URLs is straight-forward, both for client and back-end side.

Have you had a look at https://github.com/Open-EO/openeo-processes/pull/319 (WIP)?

I missed that apparently when searching for related github tickets. Some quick notes:

soxofaan commented 2 years ago

In short, I think the (technically) simplest and most versatile way to have vector cube loading functionality in the geopyspark driver and aggregator is loading from URL. In VITO and related backends we already support that in the read_vector process, but we want to standardize this and make sure it fits nicely with the other load_ processes.

Several options:

m-mohr commented 2 years ago
  • load_collection: this is about loading predefined vector cubes, which, as noted above, I don't expect to be a very practical solution for vector data

Why? Backends could surely provide some larger commonly used datasets through that, e.g. https://developers.google.com/earth-engine/datasets/tags/boundaries

  • load_result: this assumes the user already got a vector cube in the system somehow (chicken and egg) so it doesn't fundamentally solve the problem

In principle, you could load any URL, but this highly depends on the implementation. But we can surely add a "load_external" or so, which only allows loading external files from a URL. Shall I propose something?

soxofaan commented 2 years ago

load_external sounds like good solution, but I'm wondering if it would differ enough (in terms of parameters and description) from the current load_uploaded_files to justify a new/other process. I think the modalities of loading the resources can be encoded with some kind of scheme (like done in other software):

m-mohr commented 2 years ago

I think they are distinct enough and while you can implement load_external always, load_uploaded_files only works if a workspace is present. So keeping them separate is better to avoid two versions of load_external in case no user workspace is present. I'd restrict load_uploaded_files to relative paths (user workspace access, should be status quo) and load_external to (aboslute) URLs, i.e. http(s), maybe ftp or so. Where I see more commonalities are load_result (with a URL) and load_external. Should load_external also be able to accept extents, bands etc?

soxofaan commented 2 years ago

(FYI: @jdries pointed me to https://github.com/Open-EO/openeo-api/issues/135 as well, which is closely related to this discussion)

soxofaan commented 2 years ago

Should load_external also be able to accept extents, bands etc?

at the moment I'm focused on loading vector data/cubes, so additional load_collection-style filters are not high prio for me.

Moreover, the reason to have filter options inside load_collection/load_result is to allow back-ends to have load-time optimizations when loading from big data sets, based on metadata that is available at the back-end. For load_external this argument is probably not valid anymore: e.g. we're loading a relatively small number of predefined (relatively small) files, possibly without any metadata that can be leveraged to avoid loading unnecessary data. So in a first iteration I don't think full alignment with load_collection/load_result is useful.

m-mohr commented 2 years ago

Hmm, I don't really understand how https://github.com/Open-EO/openeo-api/issues/135 is closely related? It's just another way of providing a user workspace for files, right?

Anyway, I'll post a PR for load_external later. Seems pretty straighforward.

jdries commented 2 years ago

I would really prefer to have the same process for loading files both externally or from a workspace. This would again lead to the problem where process graphs would need to adapt depending on where the file is coming from. If a user tries to load a file from an internal workspace on a backend that does not at all support uploading files, then where would the file path come from?

m-mohr commented 2 years ago

It's not a good idea that we design a process that supports two options, but one may not be supported. So back-ends need to change the process definition based on the API capabilities, which is prone to issues.

As the UDP is coming to the table again and again, we should probably focus on getting a solution for calling processes by parameter: https://github.com/Open-EO/openeo-api/issues/413 / https://github.com/Open-EO/openeo-processes/pull/307 - Eventually, this issue will always arise and you can't just reduce everything into one big process, e.g. all reducers. I mean the issue already exists if you want to load from various sources, e.g. from results or from a user-uploaded file.

If a user tries to load a file from an internal workspace on a backend that does not at all support uploading files, then where would the file path come from?

I don't understand the question. If you don't implement/support uploading files, you don't implement load_uploaded_files and as such don't need a path?!

jdries commented 2 years ago

The workaround where we call processes by parameter is still more complex than simply having one process. openEO is about solving complex things in the backend, so that users don't have to deal with it. With the parameterized process proposal, both the UDP implementor and user are facing more complexity. Generalizing the problem (like the reducers analogy) doesn't really help with the discussion, we need to evaluate pros and cons on a case by case basis. Sometimes there will be good reasons to split processes, in other cases not.

My question was actually trying to reply to:

So keeping them separate is better to avoid two versions of load_external in case no user workspace is present.

I don't get the problem with loading files when no user workspace is present?

m-mohr commented 2 years ago

But if this is now coming up on a nearly weekly basis, then it seems that we should look into a solution. We have the issue here for the reducers, it's not a generalization. It's an actual use case (outside of openEO Platform). How shall I solve it? It's simply not possible except I do a lot of if/else, but that would also be possible for loading data, of course.

The issue is that you need to adopt the process specification and remove one of the schemas if user uploads are not supported. That's all I meant. So it's not ideal seeing that many providers just copy&paste the processes without adapting them and as such exposing something that is not actually there. But I could say that's an issue for them, indeed.

jdries commented 2 years ago

No problem to work on parameterization of process names, it is a useful thing to have for certain cases.

It's indeed not ideal if backends claim support for something because of copy pasting, but it's already very good that a correct solution is in fact available through the schemas. If we then ever have a very advanced aggregator that automatically adapts strategy based on where a backend can load data from, then that would work. But this is all fairly futuristic, for now, users will indeed simply get an error if an unsupported feature is used, and then they can contact the backend.

soxofaan commented 2 years ago

It's not a good idea that we design a process that supports two options, but one may not be supported.

We already do that with the format parameter: back-ends have the liberty to support different input/output formats. Supporting different storage solutions feels like the same thing, which often is implemented in other software though a protocol/scheme prefix. This is a well established concept and it won't make the process description that much harder.

As a user I appreciate it when software or a library takes care of all the annoying file format details and storage details, and I just can use the same read or load function to load, for example, a local geotiff file or a remote netcdf file. I think openEO should cater for users that expect this level of simplicity.

clausmichele commented 2 years ago

I agree with @soxofaan and I would go with load_external where it's possible to load from an URL or (if available) from the user workspace. Allowing the loading from an external URL would also allow to use vector processes on one back-end and re-use the result on another one that does not support them yet.

m-mohr commented 2 years ago

I just can use the same read or load function to load

Means eventually combining all load_* processes into a single one?

I agree with @soxofaan and I would go with load_external where it's possible to load from an URL or (if available) from the user workspace.

Okay, but user workspace is not external. So I'd propose renaming load_uploaded_files to load_file(s?) and add URL support.

Allowing the loading from an external URL would also allow to use vector processes on one back-end and re-use the result on another one that does not support them yet.

That's already captured by load_results?!

clausmichele commented 2 years ago

Okay, but user workspace is not external. So I'd propose renaming load_uploaded_files to load_file(s?) and add URL support.

yes, maybe load_file(s) fits better both scenarios.

That's already captured by load_results?!

from my point of view no. Currently load_result doesn't support vector_cubes and it has been updated to support the same parameters as load_collection (spatial_extent, temporal_extent and so on). So it has been moved to be more raster specific than general purpose.

Anyway, load_result across back-ends would work only within the federation (openEO Platform) and not with other back-ends (EURAC, GEE).

m-mohr commented 2 years ago

Currently load_result doesn't support vector_cubes

No process really supports vector-cubes yet, we are just adding it in right now. See #319.

and it has been updated to support the same parameters as load_collection (spatial_extent, temporal_extent and so on).

spatial/temporal can be used with vector, only bands is raster specific.

Anyway, load_result across back-ends would work only within the federation (openEO Platform) and not with other back-ends (EURAC, GEE).

That's maybe a restriction of openEO Platform, but the process doesn't specify such a restriction. It allows retrieving results by URL. So if you have published your result, then you could in principle load it from everywhere, even from GEE.

soxofaan commented 2 years ago

Means eventually combining all load_* processes into a single one?

Ultimately that could be an option, but I don't think we should aim for that at this point.

load_collection and load_result are a bit special in the sense that the storage details are not really standardized and mostly hidden for the user. It's probably feasible to replace load_result with loading from URLs, but loading just by job_id is a nicer UI for the user I think.

So I'd propose renaming load_uploaded_files to load_file(s?) and add URL support.

:+1: load_data could also be an option (which is a bit more generic than _files)

m-mohr commented 2 years ago

Ultimately that could be an option

But a complicated one: How do you decide between collections, files, results? They are all just strings and you eventually will run into conflicts.

It's probably feasible to replace load_result with loading from URLs

Basically, the load_* processes are now structured as such:

load_data

The load* processes are named in a way that they reflect the "source". Data is so overloaded term and general that it could basically include all load* processes.

soxofaan commented 2 years ago

But a complicated one: How do you decide between collections, files, results? They are all just strings and you eventually will run into conflicts.

not necessarily if you use scheme prefix e.g. collection:SENTINEL1_GRD, collection://openeo.vito.be/SENTINEL1_GRD, job:12312-43223-2343, job://openeo.cloud/23432543-234-23423. But then again: I don't think merging load_collection or load_result into a generic load_ process should be high prio at the moment. It could be something to consider for a v2 or v3 of the API.

load_files: Read from exactly one source file directly (i.e. required metadata needs to be included) by URL or path (uploaded by user) - no filtering due to single file and usually smaller size

FYI: load_uploaded_files currently expects an array of paths, not a single file: https://github.com/Open-EO/openeo-processes/blob/d0ce91fcd347360b907ea2d9589d7564a2c1e1e3/proposals/load_uploaded_files.json#L12-L16

Going for single file is fine for loading vector cubes I guess, but for loading raster data, it's probably best to have an option to specify multiple files.

m-mohr commented 2 years ago

Good point regarding the array of paths. I guess then we should allow arrays or URLs too. So, replace all "single" with "multiple". So then the difference is that load_result points to STAC catalogs and load_files points to the data files (in STAC terminology: assets) directly.

soxofaan commented 2 years ago

The load* processes are named in a way that they reflect the "source". Data is so overloaded term and general that it could basically include all load* processes.

"data" is indeed very generic. My concern is mainly that "file" has a very "static" sound to it, while URLs can be more dynamic, e.g. it can be an on-the-fly query. But that's probably just a matter of POV, the back-end usually does not have to be aware of that and can consider it to be a "file".

m-mohr commented 1 year ago
soxofaan commented 1 year ago

Update to listing of https://github.com/Open-EO/openeo-processes/issues/322#issuecomment-1463846950: